Review Mode (Core) (#3401)

## 📝 Review Mode -- Core

This PR introduces the Core implementation for Review mode:

- New op `Op::Review { prompt: String }:` spawns a child review task
with isolated context, a review‑specific system prompt, and a
`Config.review_model`.
- `EnteredReviewMode`: emitted when the child review session starts.
Every event from this point onwards reflects the review session.
- `ExitedReviewMode(Option<ReviewOutputEvent>)`: emitted when the review
finishes or is interrupted, with optional structured findings:

```json
{
  "findings": [
    {
      "title": "<≤ 80 chars, imperative>",
      "body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
      "confidence_score": <float 0.0-1.0>,
      "priority": <int 0-3>,
      "code_location": {
        "absolute_file_path": "<file path>",
        "line_range": {"start": <int>, "end": <int>}
      }
    }
  ],
  "overall_correctness": "patch is correct" | "patch is incorrect",
  "overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
  "overall_confidence_score": <float 0.0-1.0>
}
```

## Questions

### Why separate out its own message history?

We want the review thread to match the training of our review models as
much as possible -- that means using a custom prompt, removing user
instructions, and starting a clean chat history.

We also want to make sure the review thread doesn't leak into the parent
thread.

### Why do this as a mode, vs. sub-agents?

1. We want review to be a synchronous task, so it's fine for now to do a
bespoke implementation.
2. We're still unclear about the final structure for sub-agents. We'd
prefer to land this quickly and then refactor into sub-agents without
rushing that implementation.
This commit is contained in:
dedrisian-oai
2025-09-12 16:25:10 -07:00
committed by GitHub
parent 8d56d2f655
commit 90a0fd342f
15 changed files with 976 additions and 19 deletions

View File

@@ -0,0 +1,87 @@
# Review guidelines:
You are acting as a reviewer for a proposed code change made by another engineer.
Below are some default guidelines for determining whether the original author would appreciate the issue being flagged.
These are not the final word in determining whether an issue is a bug. In many cases, you will encounter other, more specific guidelines. These may be present elsewhere in a developer message, a user message, a file, or even elsewhere in this system message.
Those guidelines should be considered to override these general instructions.
Here are the general guidelines for determining whether something is a bug and should be flagged.
1. It meaningfully impacts the accuracy, performance, security, or maintainability of the code.
2. The bug is discrete and actionable (i.e. not a general issue with the codebase or a combination of multiple issues).
3. Fixing the bug does not demand a level of rigor that is not present in the rest of the codebase (e.g. one doesn't need very detailed comments and input validation in a repository of one-off scripts in personal projects)
4. The bug was introduced in the commit (pre-existing bugs should not be flagged).
5. The author of the original PR would likely fix the issue if they were made aware of it.
6. The bug does not rely on unstated assumptions about the codebase or author's intent.
7. It is not enough to speculate that a change may disrupt another part of the codebase, to be considered a bug, one must identify the other parts of the code that are provably affected.
8. The bug is clearly not just an intentional change by the original author.
When flagging a bug, you will also provide an accompanying comment. Once again, these guidelines are not the final word on how to construct a comment -- defer to any subsequent guidelines that you encounter.
1. The comment should be clear about why the issue is a bug.
2. The comment should appropriately communicate the severity of the issue. It should not claim that an issue is more severe than it actually is.
3. The comment should be brief. The body should be at most 1 paragraph. It should not introduce line breaks within the natural language flow unless it is necessary for the code fragment.
4. The comment should not include any chunks of code longer than 3 lines. Any code chunks should be wrapped in markdown inline code tags or a code block.
5. The comment should clearly and explicitly communicate the scenarios, environments, or inputs that are necessary for the bug to arise. The comment should immediately indicate that the issue's severity depends on these factors.
6. The comment's tone should be matter-of-fact and not accusatory or overly positive. It should read as a helpful AI assistant suggestion without sounding too much like a human reviewer.
7. The comment should be written such that the original author can immediately grasp the idea without close reading.
8. The comment should avoid excessive flattery and comments that are not helpful to the original author. The comment should avoid phrasing like "Great job ...", "Thanks for ...".
Below are some more detailed guidelines that you should apply to this specific review.
HOW MANY FINDINGS TO RETURN:
Output all findings that the original author would fix if they knew about it. If there is no finding that a person would definitely love to see and fix, prefer outputting no findings. Do not stop at the first qualifying finding. Continue until you've listed every qualifying finding.
GUIDELINES:
- Ignore trivial style unless it obscures meaning or violates documented standards.
- Use one comment per distinct issue (or a multi-line range if necessary).
- Use ```suggestion blocks ONLY for concrete replacement code (minimal lines; no commentary inside the block).
- In every ```suggestion block, preserve the exact leading whitespace of the replaced lines (spaces vs tabs, number of spaces).
- Do NOT introduce or remove outer indentation levels unless that is the actual fix.
The comments will be presented in the code review as inline comments. You should avoid providing unnecessary location details in the comment body. Always keep the line range as short as possible for interpreting the issue. Avoid ranges longer than 510 lines; instead, choose the most suitable subrange that pinpoints the problem.
At the beginning of the finding title, tag the bug with priority level. For example "[P1] Un-padding slices along wrong tensor dimensions". [P0] Drop everything to fix. Blocking release, operations, or major usage. Only use for universal issues that do not depend on any assumptions about the inputs. · [P1] Urgent. Should be addressed in the next cycle · [P2] Normal. To be fixed eventually · [P3] Low. Nice to have.
Additionally, include a numeric priority field in the JSON output for each finding: set "priority" to 0 for P0, 1 for P1, 2 for P2, or 3 for P3. If a priority cannot be determined, omit the field or use null.
At the end of your findings, output an "overall correctness" verdict of whether or not the patch should be considered "correct".
Correct implies that existing code and tests will not break, and the patch is free of bugs and other blocking issues.
Ignore non-blocking issues such as style, formatting, typos, documentation, and other nits.
FORMATTING GUIDELINES:
The finding description should be one paragraph.
OUTPUT FORMAT:
## Output schema — MUST MATCH *exactly*
```json
{
"findings": [
{
"title": "<≤ 80 chars, imperative>",
"body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
"confidence_score": <float 0.0-1.0>,
"priority": <int 0-3, optional>,
"code_location": {
"absolute_file_path": "<file path>",
"line_range": {"start": <int>, "end": <int>}
}
}
],
"overall_correctness": "patch is correct" | "patch is incorrect",
"overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
"overall_confidence_score": <float 0.0-1.0>
}
```
* **Do not** wrap the JSON in markdown fences or extra prose.
* The code_location field is required and must include absolute_file_path and line_range.
*Line ranges must be as short as possible for interpreting the issue (avoid ranges over 510 lines; pick the most suitable subrange).
* The code_location should overlap with the diff.
* Do not generate a PR fix.

View File

@@ -19,6 +19,9 @@ use tokio::sync::mpsc;
/// with this content.
const BASE_INSTRUCTIONS: &str = include_str!("../prompt.md");
/// Review thread system prompt. Edit `core/src/review_prompt.md` to customize.
pub const REVIEW_PROMPT: &str = include_str!("../review_prompt.md");
/// API request payload for a single model turn
#[derive(Default, Debug, Clone)]
pub struct Prompt {

View File

@@ -9,6 +9,7 @@ use std::sync::atomic::AtomicU64;
use std::time::Duration;
use crate::AuthManager;
use crate::client_common::REVIEW_PROMPT;
use crate::event_mapping::map_response_item_to_event_messages;
use async_channel::Receiver;
use async_channel::Sender;
@@ -17,6 +18,7 @@ use codex_apply_patch::MaybeApplyPatchVerified;
use codex_apply_patch::maybe_parse_apply_patch_verified;
use codex_protocol::mcp_protocol::ConversationId;
use codex_protocol::protocol::ConversationPathResponseEvent;
use codex_protocol::protocol::ReviewRequest;
use codex_protocol::protocol::RolloutItem;
use codex_protocol::protocol::TaskStartedEvent;
use codex_protocol::protocol::TurnAbortReason;
@@ -95,6 +97,7 @@ use crate::protocol::Op;
use crate::protocol::PatchApplyBeginEvent;
use crate::protocol::PatchApplyEndEvent;
use crate::protocol::ReviewDecision;
use crate::protocol::ReviewOutputEvent;
use crate::protocol::SandboxPolicy;
use crate::protocol::SessionConfiguredEvent;
use crate::protocol::StreamErrorEvent;
@@ -307,6 +310,7 @@ pub(crate) struct TurnContext {
pub(crate) sandbox_policy: SandboxPolicy,
pub(crate) shell_environment_policy: ShellEnvironmentPolicy,
pub(crate) tools_config: ToolsConfig,
pub(crate) is_review_mode: bool,
}
impl TurnContext {
@@ -478,6 +482,7 @@ impl Session {
sandbox_policy,
shell_environment_policy: config.shell_environment_policy.clone(),
cwd,
is_review_mode: false,
};
let sess = Arc::new(Session {
conversation_id,
@@ -1032,11 +1037,19 @@ pub(crate) struct ApplyPatchCommandContext {
pub(crate) changes: HashMap<PathBuf, FileChange>,
}
#[derive(Clone, Debug, Eq, PartialEq)]
enum AgentTaskKind {
Regular,
Review,
Compact,
}
/// A series of Turns in response to user input.
pub(crate) struct AgentTask {
sess: Arc<Session>,
sub_id: String,
handle: AbortHandle,
kind: AgentTaskKind,
}
impl AgentTask {
@@ -1056,6 +1069,27 @@ impl AgentTask {
sess,
sub_id,
handle,
kind: AgentTaskKind::Regular,
}
}
fn review(
sess: Arc<Session>,
turn_context: Arc<TurnContext>,
sub_id: String,
input: Vec<InputItem>,
) -> Self {
let handle = {
let sess = sess.clone();
let sub_id = sub_id.clone();
let tc = Arc::clone(&turn_context);
tokio::spawn(async move { run_task(sess, tc, sub_id, input).await }).abort_handle()
};
Self {
sess,
sub_id,
handle,
kind: AgentTaskKind::Review,
}
}
@@ -1079,12 +1113,20 @@ impl AgentTask {
sess,
sub_id,
handle,
kind: AgentTaskKind::Compact,
}
}
fn abort(self, reason: TurnAbortReason) {
// TOCTOU?
if !self.handle.is_finished() {
if self.kind == AgentTaskKind::Review {
let sess = self.sess.clone();
let sub_id = self.sub_id.clone();
tokio::spawn(async move {
exit_review_mode(sess, sub_id, None).await;
});
}
self.handle.abort();
let event = Event {
id: self.sub_id,
@@ -1184,6 +1226,7 @@ async fn submission_loop(
sandbox_policy: new_sandbox_policy.clone(),
shell_environment_policy: prev.shell_environment_policy.clone(),
cwd: new_cwd.clone(),
is_review_mode: false,
};
// Install the new persistent context for subsequent tasks/turns.
@@ -1269,6 +1312,7 @@ async fn submission_loop(
sandbox_policy,
shell_environment_policy: turn_context.shell_environment_policy.clone(),
cwd,
is_review_mode: false,
};
// TODO: record the new environment context in the conversation history
// no current task, spawn a new one with the perturn context
@@ -1430,6 +1474,16 @@ async fn submission_loop(
};
sess.send_event(event).await;
}
Op::Review { review_request } => {
spawn_review_thread(
sess.clone(),
config.clone(),
turn_context.clone(),
sub.id,
review_request,
)
.await;
}
_ => {
// Ignore unknown ops; enum is non_exhaustive to allow extensions.
}
@@ -1438,6 +1492,82 @@ async fn submission_loop(
debug!("Agent loop exited");
}
/// Spawn a review thread using the given prompt.
async fn spawn_review_thread(
sess: Arc<Session>,
config: Arc<Config>,
parent_turn_context: Arc<TurnContext>,
sub_id: String,
review_request: ReviewRequest,
) {
let model = config.review_model.clone();
let review_model_family = find_family_for_model(&model)
.unwrap_or_else(|| parent_turn_context.client.get_model_family());
let tools_config = ToolsConfig::new(&ToolsConfigParams {
model_family: &review_model_family,
approval_policy: parent_turn_context.approval_policy,
sandbox_policy: parent_turn_context.sandbox_policy.clone(),
include_plan_tool: false,
include_apply_patch_tool: config.include_apply_patch_tool,
include_web_search_request: false,
use_streamable_shell_tool: false,
include_view_image_tool: false,
experimental_unified_exec_tool: config.use_experimental_unified_exec_tool,
});
let base_instructions = Some(REVIEW_PROMPT.to_string());
let provider = parent_turn_context.client.get_provider();
let auth_manager = parent_turn_context.client.get_auth_manager();
let model_family = review_model_family.clone();
// Build perturn client with the requested model/family.
let mut per_turn_config = (*config).clone();
per_turn_config.model = model.clone();
per_turn_config.model_family = model_family.clone();
if let Some(model_info) = get_model_info(&model_family) {
per_turn_config.model_context_window = Some(model_info.context_window);
}
let client = ModelClient::new(
Arc::new(per_turn_config),
auth_manager,
provider,
parent_turn_context.client.get_reasoning_effort(),
parent_turn_context.client.get_reasoning_summary(),
sess.conversation_id,
);
let review_turn_context = TurnContext {
client,
tools_config,
user_instructions: None,
base_instructions,
approval_policy: parent_turn_context.approval_policy,
sandbox_policy: parent_turn_context.sandbox_policy.clone(),
shell_environment_policy: parent_turn_context.shell_environment_policy.clone(),
cwd: parent_turn_context.cwd.clone(),
is_review_mode: true,
};
// Seed the child task with the review prompt as the initial user message.
let input: Vec<InputItem> = vec![InputItem::Text {
text: review_request.prompt.clone(),
}];
let tc = Arc::new(review_turn_context);
// Clone sub_id for the upcoming announcement before moving it into the task.
let sub_id_for_event = sub_id.clone();
let task = AgentTask::review(sess.clone(), tc.clone(), sub_id, input);
sess.set_task(task);
// Announce entering review mode so UIs can switch modes.
sess.send_event(Event {
id: sub_id_for_event,
msg: EventMsg::EnteredReviewMode(review_request),
})
.await;
}
/// Takes a user message as input and runs a loop where, at each turn, the model
/// replies with either:
///
@@ -1451,6 +1581,10 @@ async fn submission_loop(
/// back to the model in the next turn.
/// - If the model sends only an assistant message, we record it in the
/// conversation history and consider the task complete.
///
/// Review mode: when `turn_context.is_review_mode` is true, the turn runs in an
/// isolated in-memory thread without the parent session's prior history or
/// user_instructions. Emits ExitedReviewMode upon final review message.
async fn run_task(
sess: Arc<Session>,
turn_context: Arc<TurnContext>,
@@ -1469,8 +1603,17 @@ async fn run_task(
sess.send_event(event).await;
let initial_input_for_turn: ResponseInputItem = ResponseInputItem::from(input);
sess.record_input_and_rollout_usermsg(&initial_input_for_turn)
.await;
// For review threads, keep an isolated in-memory history so the
// model sees a fresh conversation without the parent session's history.
// For normal turns, continue recording to the session history as before.
let is_review_mode = turn_context.is_review_mode;
let mut review_thread_history: Vec<ResponseItem> = Vec::new();
if is_review_mode {
review_thread_history.push(initial_input_for_turn.into());
} else {
sess.record_input_and_rollout_usermsg(&initial_input_for_turn)
.await;
}
let mut last_agent_message: Option<String> = None;
// Although from the perspective of codex.rs, TurnDiffTracker has the lifecycle of a Task which contains
@@ -1487,14 +1630,26 @@ async fn run_task(
.into_iter()
.map(ResponseItem::from)
.collect::<Vec<ResponseItem>>();
sess.record_conversation_items(&pending_input).await;
// Construct the input that we will send to the model. When using the
// Chat completions API (or ZDR clients), the model needs the full
// conversation history on each turn. The rollout file, however, should
// only record the new items that originated in this turn so that it
// represents an append-only log without duplicates.
let turn_input: Vec<ResponseItem> = sess.turn_input_with_history(pending_input);
// Construct the input that we will send to the model.
//
// - For review threads, use the isolated in-memory history so the
// model sees a fresh conversation (no parent history/user_instructions).
//
// - For normal turns, use the session's full history. When using the
// chat completions API (or ZDR clients), the model needs the full
// conversation history on each turn. The rollout file, however, should
// only record the new items that originated in this turn so that it
// represents an append-only log without duplicates.
let turn_input: Vec<ResponseItem> = if is_review_mode {
if !pending_input.is_empty() {
review_thread_history.extend(pending_input);
}
review_thread_history.clone()
} else {
sess.record_conversation_items(&pending_input).await;
sess.turn_input_with_history(pending_input)
};
let turn_input_messages: Vec<String> = turn_input
.iter()
@@ -1628,8 +1783,13 @@ async fn run_task(
// Only attempt to take the lock if there is something to record.
if !items_to_record_in_conversation_history.is_empty() {
sess.record_conversation_items(&items_to_record_in_conversation_history)
.await;
if is_review_mode {
review_thread_history
.extend(items_to_record_in_conversation_history.clone());
} else {
sess.record_conversation_items(&items_to_record_in_conversation_history)
.await;
}
}
if token_limit_reached {
@@ -1683,6 +1843,23 @@ async fn run_task(
}
}
}
// If this was a review thread and we have a final assistant message,
// try to parse it as a ReviewOutput.
//
// If parsing fails, construct a minimal ReviewOutputEvent using the plain
// text as the overall explanation. Else, just exit review mode with None.
//
// Emits an ExitedReviewMode event with the parsed review output.
if turn_context.is_review_mode {
exit_review_mode(
sess.clone(),
sub_id.clone(),
last_agent_message.as_deref().map(parse_review_output_event),
)
.await;
}
sess.remove_task(&sub_id);
let event = Event {
id: sub_id,
@@ -1691,6 +1868,31 @@ async fn run_task(
sess.send_event(event).await;
}
/// Parse the review output; when not valid JSON, build a structured
/// fallback that carries the plain text as the overall explanation.
///
/// Returns: a ReviewOutputEvent parsed from JSON or a fallback populated from text.
fn parse_review_output_event(text: &str) -> ReviewOutputEvent {
// Try direct parse first
if let Ok(ev) = serde_json::from_str::<ReviewOutputEvent>(text) {
return ev;
}
// If wrapped in markdown fences or extra prose, attempt to extract the first JSON object
if let (Some(start), Some(end)) = (text.find('{'), text.rfind('}'))
&& start < end
&& let Some(slice) = text.get(start..=end)
&& let Ok(ev) = serde_json::from_str::<ReviewOutputEvent>(slice)
{
return ev;
}
// Not JSON return a structured ReviewOutputEvent that carries
// the plain text as the overall explanation.
ReviewOutputEvent {
overall_explanation: text.to_string(),
..Default::default()
}
}
async fn run_turn(
sess: &Session,
turn_context: &TurnContext,
@@ -1917,11 +2119,17 @@ async fn try_run_turn(
return Ok(result);
}
ResponseEvent::OutputTextDelta(delta) => {
let event = Event {
id: sub_id.to_string(),
msg: EventMsg::AgentMessageDelta(AgentMessageDeltaEvent { delta }),
};
sess.send_event(event).await;
// In review child threads, suppress assistant text deltas; the
// UI will show a selection popup from the final ReviewOutput.
if !turn_context.is_review_mode {
let event = Event {
id: sub_id.to_string(),
msg: EventMsg::AgentMessageDelta(AgentMessageDeltaEvent { delta }),
};
sess.send_event(event).await;
} else {
trace!("suppressing OutputTextDelta in review mode");
}
}
ResponseEvent::ReasoningSummaryDelta(delta) => {
let event = Event {
@@ -2053,7 +2261,15 @@ async fn handle_response_item(
ResponseItem::Message { .. }
| ResponseItem::Reasoning { .. }
| ResponseItem::WebSearchCall { .. } => {
let msgs = map_response_item_to_event_messages(&item, sess.show_raw_agent_reasoning);
// In review child threads, suppress assistant message events but
// keep reasoning/web search.
let msgs = match &item {
ResponseItem::Message { .. } if turn_context.is_review_mode => {
trace!("suppressing assistant Message in review mode");
Vec::new()
}
_ => map_response_item_to_event_messages(&item, sess.show_raw_agent_reasoning),
};
for msg in msgs {
let event = Event {
id: sub_id.to_string(),
@@ -2988,6 +3204,19 @@ fn convert_call_tool_result_to_function_call_output_payload(
}
}
/// Emits an ExitedReviewMode Event with optional ReviewOutput.
async fn exit_review_mode(
session: Arc<Session>,
task_sub_id: String,
res: Option<ReviewOutputEvent>,
) {
let event = Event {
id: task_sub_id,
msg: EventMsg::ExitedReviewMode(res),
};
session.send_event(event).await;
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -32,6 +32,7 @@ use toml::Value as TomlValue;
use toml_edit::DocumentMut;
const OPENAI_DEFAULT_MODEL: &str = "gpt-5";
const OPENAI_DEFAULT_REVIEW_MODEL: &str = "gpt-5";
pub const GPT5_HIGH_MODEL: &str = "gpt-5-high-new";
/// Maximum number of bytes of the documentation that will be embedded. Larger
@@ -47,6 +48,9 @@ pub struct Config {
/// Optional override of model selection.
pub model: String,
/// Model used specifically for review sessions. Defaults to "gpt-5".
pub review_model: String,
pub model_family: ModelFamily,
/// Size of the context window for the model, in tokens.
@@ -512,6 +516,8 @@ fn apply_toml_override(root: &mut TomlValue, path: &str, value: TomlValue) {
pub struct ConfigToml {
/// Optional override of model selection.
pub model: Option<String>,
/// Review model override used by the `/review` feature.
pub review_model: Option<String>,
/// Provider to use from the model_providers map.
pub model_provider: Option<String>,
@@ -740,6 +746,7 @@ impl ConfigToml {
#[derive(Default, Debug, Clone)]
pub struct ConfigOverrides {
pub model: Option<String>,
pub review_model: Option<String>,
pub cwd: Option<PathBuf>,
pub approval_policy: Option<AskForApproval>,
pub sandbox_mode: Option<SandboxMode>,
@@ -767,6 +774,7 @@ impl Config {
// Destructure ConfigOverrides fully to ensure all overrides are applied.
let ConfigOverrides {
model,
review_model: override_review_model,
cwd,
approval_policy,
sandbox_mode,
@@ -902,8 +910,14 @@ impl Config {
Self::get_base_instructions(experimental_instructions_path, &resolved_cwd)?;
let base_instructions = base_instructions.or(file_base_instructions);
// Default review model when not set in config; allow CLI override to take precedence.
let review_model = override_review_model
.or(cfg.review_model)
.unwrap_or_else(default_review_model);
let config = Self {
model,
review_model,
model_family,
model_context_window,
model_max_output_tokens,
@@ -1027,6 +1041,10 @@ fn default_model() -> String {
OPENAI_DEFAULT_MODEL.to_string()
}
fn default_review_model() -> String {
OPENAI_DEFAULT_REVIEW_MODEL.to_string()
}
/// Returns the path to the Codex configuration directory, which can be
/// specified by the `CODEX_HOME` environment variable. If not set, defaults to
/// `~/.codex`.
@@ -1439,6 +1457,7 @@ model_verbosity = "high"
assert_eq!(
Config {
model: "o3".to_string(),
review_model: "gpt-5".to_string(),
model_family: find_family_for_model("o3").expect("known model slug"),
model_context_window: Some(200_000),
model_max_output_tokens: Some(100_000),
@@ -1496,6 +1515,7 @@ model_verbosity = "high"
)?;
let expected_gpt3_profile_config = Config {
model: "gpt-3.5-turbo".to_string(),
review_model: "gpt-5".to_string(),
model_family: find_family_for_model("gpt-3.5-turbo").expect("known model slug"),
model_context_window: Some(16_385),
model_max_output_tokens: Some(4_096),
@@ -1568,6 +1588,7 @@ model_verbosity = "high"
)?;
let expected_zdr_profile_config = Config {
model: "o3".to_string(),
review_model: "gpt-5".to_string(),
model_family: find_family_for_model("o3").expect("known model slug"),
model_context_window: Some(200_000),
model_max_output_tokens: Some(100_000),
@@ -1626,6 +1647,7 @@ model_verbosity = "high"
)?;
let expected_gpt5_profile_config = Config {
model: "gpt-5".to_string(),
review_model: "gpt-5".to_string(),
model_family: find_family_for_model("gpt-5").expect("known model slug"),
model_context_window: Some(272_000),
model_max_output_tokens: Some(128_000),

View File

@@ -38,7 +38,9 @@ pub(crate) fn should_persist_event_msg(ev: &EventMsg) -> bool {
| EventMsg::AgentMessage(_)
| EventMsg::AgentReasoning(_)
| EventMsg::AgentReasoningRawContent(_)
| EventMsg::TokenCount(_) => true,
| EventMsg::TokenCount(_)
| EventMsg::EnteredReviewMode(_)
| EventMsg::ExitedReviewMode(_) => true,
EventMsg::Error(_)
| EventMsg::TaskStarted(_)
| EventMsg::TaskComplete(_)

View File

@@ -9,6 +9,7 @@ mod fork_conversation;
mod live_cli;
mod model_overrides;
mod prompt_caching;
mod review;
mod seatbelt;
mod stream_error_allows_next_turn;
mod stream_no_completed;

View File

@@ -0,0 +1,542 @@
use codex_core::CodexAuth;
use codex_core::CodexConversation;
use codex_core::ConversationManager;
use codex_core::ModelProviderInfo;
use codex_core::built_in_model_providers;
use codex_core::config::Config;
use codex_core::protocol::EventMsg;
use codex_core::protocol::InputItem;
use codex_core::protocol::Op;
use codex_core::protocol::ReviewCodeLocation;
use codex_core::protocol::ReviewFinding;
use codex_core::protocol::ReviewLineRange;
use codex_core::protocol::ReviewOutputEvent;
use codex_core::protocol::ReviewRequest;
use codex_core::spawn::CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR;
use core_test_support::load_default_config_for_test;
use core_test_support::load_sse_fixture_with_id_from_str;
use core_test_support::wait_for_event;
use pretty_assertions::assert_eq;
use std::path::PathBuf;
use std::sync::Arc;
use tempfile::TempDir;
use tokio::io::AsyncWriteExt as _;
use uuid::Uuid;
use wiremock::Mock;
use wiremock::MockServer;
use wiremock::ResponseTemplate;
use wiremock::matchers::method;
use wiremock::matchers::path;
/// Verify that submitting `Op::Review` spawns a child task and emits
/// EnteredReviewMode -> ExitedReviewMode(None) -> TaskComplete
/// in that order when the model returns a structured review JSON payload.
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_op_emits_lifecycle_and_review_output() {
// Skip under Codex sandbox network restrictions.
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
// Start mock Responses API server. Return a single assistant message whose
// text is a JSON-encoded ReviewOutputEvent.
let review_json = serde_json::json!({
"findings": [
{
"title": "Prefer Stylize helpers",
"body": "Use .dim()/.bold() chaining instead of manual Style where possible.",
"confidence_score": 0.9,
"priority": 1,
"code_location": {
"absolute_file_path": "/tmp/file.rs",
"line_range": {"start": 10, "end": 20}
}
}
],
"overall_correctness": "good",
"overall_explanation": "All good with some improvements suggested.",
"overall_confidence_score": 0.8
})
.to_string();
let sse_template = r#"[
{"type":"response.output_item.done", "item":{
"type":"message", "role":"assistant",
"content":[{"type":"output_text","text":__REVIEW__}]
}},
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let review_json_escaped = serde_json::to_string(&review_json).unwrap();
let sse_raw = sse_template.replace("__REVIEW__", &review_json_escaped);
let server = start_responses_server_with_sse(&sse_raw, 1).await;
let codex_home = TempDir::new().unwrap();
let codex = new_conversation_for_server(&server, &codex_home, |_| {}).await;
// Submit review request.
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: "Please review my changes".to_string(),
user_facing_hint: "my changes".to_string(),
},
})
.await
.unwrap();
// Verify lifecycle: Entered -> Exited(Some(review)) -> TaskComplete.
let _entered = wait_for_event(&codex, |ev| matches!(ev, EventMsg::EnteredReviewMode(_))).await;
let closed = wait_for_event(&codex, |ev| matches!(ev, EventMsg::ExitedReviewMode(_))).await;
let review = match closed {
EventMsg::ExitedReviewMode(Some(r)) => r,
other => panic!("expected ExitedReviewMode(Some(..)), got {other:?}"),
};
// Deep compare full structure using PartialEq (floats are f32 on both sides).
let expected = ReviewOutputEvent {
findings: vec![ReviewFinding {
title: "Prefer Stylize helpers".to_string(),
body: "Use .dim()/.bold() chaining instead of manual Style where possible.".to_string(),
confidence_score: 0.9,
priority: 1,
code_location: ReviewCodeLocation {
absolute_file_path: PathBuf::from("/tmp/file.rs"),
line_range: ReviewLineRange { start: 10, end: 20 },
},
}],
overall_correctness: "good".to_string(),
overall_explanation: "All good with some improvements suggested.".to_string(),
overall_confidence_score: 0.8,
};
assert_eq!(expected, review);
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
server.verify().await;
}
/// When the model returns plain text that is not JSON, ensure the child
/// lifecycle still occurs and the plain text is surfaced via
/// ExitedReviewMode(Some(..)) as the overall_explanation.
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_op_with_plain_text_emits_review_fallback() {
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
let sse_raw = r#"[
{"type":"response.output_item.done", "item":{
"type":"message", "role":"assistant",
"content":[{"type":"output_text","text":"just plain text"}]
}},
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let server = start_responses_server_with_sse(sse_raw, 1).await;
let codex_home = TempDir::new().unwrap();
let codex = new_conversation_for_server(&server, &codex_home, |_| {}).await;
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: "Plain text review".to_string(),
user_facing_hint: "plain text review".to_string(),
},
})
.await
.unwrap();
let _entered = wait_for_event(&codex, |ev| matches!(ev, EventMsg::EnteredReviewMode(_))).await;
let closed = wait_for_event(&codex, |ev| matches!(ev, EventMsg::ExitedReviewMode(_))).await;
let review = match closed {
EventMsg::ExitedReviewMode(Some(r)) => r,
other => panic!("expected ExitedReviewMode(Some(..)), got {other:?}"),
};
// Expect a structured fallback carrying the plain text.
let expected = ReviewOutputEvent {
overall_explanation: "just plain text".to_string(),
..Default::default()
};
assert_eq!(expected, review);
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
server.verify().await;
}
/// When the model returns structured JSON in a review, ensure no AgentMessage
/// is emitted; the UI consumes the structured result via ExitedReviewMode.
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_does_not_emit_agent_message_on_structured_output() {
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
let review_json = serde_json::json!({
"findings": [
{
"title": "Example",
"body": "Structured review output.",
"confidence_score": 0.5,
"priority": 1,
"code_location": {
"absolute_file_path": "/tmp/file.rs",
"line_range": {"start": 1, "end": 2}
}
}
],
"overall_correctness": "ok",
"overall_explanation": "ok",
"overall_confidence_score": 0.5
})
.to_string();
let sse_template = r#"[
{"type":"response.output_item.done", "item":{
"type":"message", "role":"assistant",
"content":[{"type":"output_text","text":__REVIEW__}]
}},
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let review_json_escaped = serde_json::to_string(&review_json).unwrap();
let sse_raw = sse_template.replace("__REVIEW__", &review_json_escaped);
let server = start_responses_server_with_sse(&sse_raw, 1).await;
let codex_home = TempDir::new().unwrap();
let codex = new_conversation_for_server(&server, &codex_home, |_| {}).await;
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: "check structured".to_string(),
user_facing_hint: "check structured".to_string(),
},
})
.await
.unwrap();
// Drain events until TaskComplete; ensure none are AgentMessage.
use tokio::time::Duration;
use tokio::time::timeout;
let mut saw_entered = false;
let mut saw_exited = false;
loop {
let ev = timeout(Duration::from_secs(5), codex.next_event())
.await
.expect("timeout waiting for event")
.expect("stream ended unexpectedly");
match ev.msg {
EventMsg::TaskComplete(_) => break,
EventMsg::AgentMessage(_) => {
panic!("unexpected AgentMessage during review with structured output")
}
EventMsg::EnteredReviewMode(_) => saw_entered = true,
EventMsg::ExitedReviewMode(_) => saw_exited = true,
_ => {}
}
}
assert!(saw_entered && saw_exited, "missing review lifecycle events");
server.verify().await;
}
/// Ensure that when a custom `review_model` is set in the config, the review
/// request uses that model (and not the main chat model).
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_uses_custom_review_model_from_config() {
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
// Minimal stream: just a completed event
let sse_raw = r#"[
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let server = start_responses_server_with_sse(sse_raw, 1).await;
let codex_home = TempDir::new().unwrap();
// Choose a review model different from the main model; ensure it is used.
let codex = new_conversation_for_server(&server, &codex_home, |cfg| {
cfg.model = "gpt-4.1".to_string();
cfg.review_model = "gpt-5".to_string();
})
.await;
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: "use custom model".to_string(),
user_facing_hint: "use custom model".to_string(),
},
})
.await
.unwrap();
// Wait for completion
let _entered = wait_for_event(&codex, |ev| matches!(ev, EventMsg::EnteredReviewMode(_))).await;
let _closed = wait_for_event(&codex, |ev| matches!(ev, EventMsg::ExitedReviewMode(None))).await;
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
// Assert the request body model equals the configured review model
let request = &server.received_requests().await.unwrap()[0];
let body = request.body_json::<serde_json::Value>().unwrap();
assert_eq!(body["model"].as_str().unwrap(), "gpt-5");
server.verify().await;
}
/// When a review session begins, it must not prepend prior chat history from
/// the parent session. The request `input` should contain only the review
/// prompt from the user.
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_input_isolated_from_parent_history() {
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
// Mock server for the single review request
let sse_raw = r#"[
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let server = start_responses_server_with_sse(sse_raw, 1).await;
// Seed a parent session history via resume file with both user + assistant items.
let codex_home = TempDir::new().unwrap();
let mut config = load_default_config_for_test(&codex_home);
config.model_provider = ModelProviderInfo {
base_url: Some(format!("{}/v1", server.uri())),
..built_in_model_providers()["openai"].clone()
};
let session_file = codex_home.path().join("resume.jsonl");
{
let mut f = tokio::fs::File::create(&session_file).await.unwrap();
let convo_id = Uuid::new_v4();
// Proper session_meta line (enveloped) with a conversation id
let meta_line = serde_json::json!({
"timestamp": "2024-01-01T00:00:00.000Z",
"type": "session_meta",
"payload": {
"id": convo_id,
"timestamp": "2024-01-01T00:00:00Z",
"instructions": null,
"cwd": ".",
"originator": "test_originator",
"cli_version": "test_version"
}
});
f.write_all(format!("{meta_line}\n").as_bytes())
.await
.unwrap();
// Prior user message (enveloped response_item)
let user = codex_protocol::models::ResponseItem::Message {
id: None,
role: "user".to_string(),
content: vec![codex_protocol::models::ContentItem::InputText {
text: "parent: earlier user message".to_string(),
}],
};
let user_json = serde_json::to_value(&user).unwrap();
let user_line = serde_json::json!({
"timestamp": "2024-01-01T00:00:01.000Z",
"type": "response_item",
"payload": user_json
});
f.write_all(format!("{user_line}\n").as_bytes())
.await
.unwrap();
// Prior assistant message (enveloped response_item)
let assistant = codex_protocol::models::ResponseItem::Message {
id: None,
role: "assistant".to_string(),
content: vec![codex_protocol::models::ContentItem::OutputText {
text: "parent: assistant reply".to_string(),
}],
};
let assistant_json = serde_json::to_value(&assistant).unwrap();
let assistant_line = serde_json::json!({
"timestamp": "2024-01-01T00:00:02.000Z",
"type": "response_item",
"payload": assistant_json
});
f.write_all(format!("{assistant_line}\n").as_bytes())
.await
.unwrap();
}
config.experimental_resume = Some(session_file);
let codex = new_conversation_for_server(&server, &codex_home, |cfg| {
// apply resume file
cfg.experimental_resume = config.experimental_resume.clone();
})
.await;
// Submit review request; it must start fresh (no parent history in `input`).
let review_prompt = "Please review only this".to_string();
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: review_prompt.clone(),
user_facing_hint: review_prompt.clone(),
},
})
.await
.unwrap();
let _entered = wait_for_event(&codex, |ev| matches!(ev, EventMsg::EnteredReviewMode(_))).await;
let _closed = wait_for_event(&codex, |ev| matches!(ev, EventMsg::ExitedReviewMode(None))).await;
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
// Assert the request `input` contains only the single review user message.
let request = &server.received_requests().await.unwrap()[0];
let body = request.body_json::<serde_json::Value>().unwrap();
let expected_input = serde_json::json!([
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": review_prompt}]
}
]);
assert_eq!(body["input"], expected_input);
server.verify().await;
}
/// After a review thread finishes, its conversation should not leak into the
/// parent session. A subsequent parent turn must not include any review
/// messages in its request `input`.
#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn review_history_does_not_leak_into_parent_session() {
if std::env::var(CODEX_SANDBOX_NETWORK_DISABLED_ENV_VAR).is_ok() {
println!(
"Skipping test because it cannot execute when network is disabled in a Codex sandbox."
);
return;
}
// Respond to both the review request and the subsequent parent request.
let sse_raw = r#"[
{"type":"response.output_item.done", "item":{
"type":"message", "role":"assistant",
"content":[{"type":"output_text","text":"review assistant output"}]
}},
{"type":"response.completed", "response": {"id": "__ID__"}}
]"#;
let server = start_responses_server_with_sse(sse_raw, 2).await;
let codex_home = TempDir::new().unwrap();
let codex = new_conversation_for_server(&server, &codex_home, |_| {}).await;
// 1) Run a review turn that produces an assistant message (isolated in child).
codex
.submit(Op::Review {
review_request: ReviewRequest {
prompt: "Start a review".to_string(),
user_facing_hint: "Start a review".to_string(),
},
})
.await
.unwrap();
let _entered = wait_for_event(&codex, |ev| matches!(ev, EventMsg::EnteredReviewMode(_))).await;
let _closed = wait_for_event(&codex, |ev| {
matches!(ev, EventMsg::ExitedReviewMode(Some(_)))
})
.await;
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
// 2) Continue in the parent session; request input must not include any review items.
let followup = "back to parent".to_string();
codex
.submit(Op::UserInput {
items: vec![InputItem::Text {
text: followup.clone(),
}],
})
.await
.unwrap();
let _complete = wait_for_event(&codex, |ev| matches!(ev, EventMsg::TaskComplete(_))).await;
// Inspect the second request (parent turn) input contents.
// Parent turns include session initial messages (user_instructions, environment_context).
// Critically, no messages from the review thread should appear.
let requests = server.received_requests().await.unwrap();
assert_eq!(requests.len(), 2);
let body = requests[1].body_json::<serde_json::Value>().unwrap();
let input = body["input"].as_array().expect("input array");
// Must include the followup as the last item for this turn
let last = input.last().expect("at least one item in input");
assert_eq!(last["role"].as_str().unwrap(), "user");
let last_text = last["content"][0]["text"].as_str().unwrap();
assert_eq!(last_text, followup);
// Ensure no review-thread content leaked into the parent request
let contains_review_prompt = input
.iter()
.any(|msg| msg["content"][0]["text"].as_str().unwrap_or_default() == "Start a review");
let contains_review_assistant = input.iter().any(|msg| {
msg["content"][0]["text"].as_str().unwrap_or_default() == "review assistant output"
});
assert!(
!contains_review_prompt,
"review prompt leaked into parent turn input"
);
assert!(
!contains_review_assistant,
"review assistant output leaked into parent turn input"
);
server.verify().await;
}
/// Start a mock Responses API server and mount the given SSE stream body.
async fn start_responses_server_with_sse(sse_raw: &str, expected_requests: usize) -> MockServer {
let server = MockServer::start().await;
let sse = load_sse_fixture_with_id_from_str(sse_raw, &Uuid::new_v4().to_string());
Mock::given(method("POST"))
.and(path("/v1/responses"))
.respond_with(
ResponseTemplate::new(200)
.insert_header("content-type", "text/event-stream")
.set_body_raw(sse.clone(), "text/event-stream"),
)
.expect(expected_requests as u64)
.mount(&server)
.await;
server
}
/// Create a conversation configured to talk to the provided mock server.
#[expect(clippy::expect_used)]
async fn new_conversation_for_server<F>(
server: &MockServer,
codex_home: &TempDir,
mutator: F,
) -> Arc<CodexConversation>
where
F: FnOnce(&mut Config),
{
let model_provider = ModelProviderInfo {
base_url: Some(format!("{}/v1", server.uri())),
..built_in_model_providers()["openai"].clone()
};
let mut config = load_default_config_for_test(codex_home);
config.model_provider = model_provider;
mutator(&mut config);
let conversation_manager =
ConversationManager::with_auth(CodexAuth::from_api_key("Test API Key"));
conversation_manager
.new_conversation(config)
.await
.expect("create conversation")
.conversation
}

View File

@@ -562,6 +562,8 @@ impl EventProcessor for EventProcessorWithHumanOutput {
EventMsg::ShutdownComplete => return CodexStatus::Shutdown,
EventMsg::ConversationPath(_) => {}
EventMsg::UserMessage(_) => {}
EventMsg::EnteredReviewMode(_) => {}
EventMsg::ExitedReviewMode(_) => {}
}
CodexStatus::Running
}

View File

@@ -137,6 +137,7 @@ pub async fn run_main(cli: Cli, codex_linux_sandbox_exe: Option<PathBuf>) -> any
// Load configuration and determine approval policy
let overrides = ConfigOverrides {
model,
review_model: None,
config_profile,
// This CLI is intended to be headless and has no affordances for asking
// the user for approval.

View File

@@ -1248,6 +1248,7 @@ fn derive_config_from_params(
} = params;
let overrides = ConfigOverrides {
model,
review_model: None,
config_profile: profile,
cwd: cwd.map(PathBuf::from),
approval_policy,

View File

@@ -152,6 +152,7 @@ impl CodexToolCallParam {
// Build the `ConfigOverrides` recognized by codex-core.
let overrides = codex_core::config::ConfigOverrides {
model,
review_model: None,
config_profile: profile,
cwd: cwd.map(PathBuf::from),
approval_policy: approval_policy.map(Into::into),

View File

@@ -279,7 +279,9 @@ async fn run_codex_tool_session_inner(
| EventMsg::TurnAborted(_)
| EventMsg::ConversationPath(_)
| EventMsg::UserMessage(_)
| EventMsg::ShutdownComplete => {
| EventMsg::ShutdownComplete
| EventMsg::EnteredReviewMode(_)
| EventMsg::ExitedReviewMode(_) => {
// For now, we do not do anything extra for these
// events. Note that
// send(codex_event_to_notification(&event)) above has

View File

@@ -166,6 +166,10 @@ pub enum Op {
/// The agent will use its existing context (either conversation history or previous response id)
/// to generate a summary which will be returned as an AgentMessage event.
Compact,
/// Request a code review from the agent.
Review { review_request: ReviewRequest },
/// Request to shut down codex instance.
Shutdown,
}
@@ -504,6 +508,12 @@ pub enum EventMsg {
ShutdownComplete,
ConversationPath(ConversationPathResponseEvent),
/// Entered review mode.
EnteredReviewMode(ReviewRequest),
/// Exited review mode with an optional final result to apply.
ExitedReviewMode(Option<ReviewOutputEvent>),
}
// Individual event payload types matching each `EventMsg` variant.
@@ -962,6 +972,57 @@ pub struct GitInfo {
pub repository_url: Option<String>,
}
/// Review request sent to the review session.
#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, TS)]
pub struct ReviewRequest {
pub prompt: String,
pub user_facing_hint: String,
}
/// Structured review result produced by a child review session.
#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, TS)]
pub struct ReviewOutputEvent {
pub findings: Vec<ReviewFinding>,
pub overall_correctness: String,
pub overall_explanation: String,
pub overall_confidence_score: f32,
}
impl Default for ReviewOutputEvent {
fn default() -> Self {
Self {
findings: Vec::new(),
overall_correctness: String::default(),
overall_explanation: String::default(),
overall_confidence_score: 0.0,
}
}
}
/// A single review finding describing an observed issue or recommendation.
#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, TS)]
pub struct ReviewFinding {
pub title: String,
pub body: String,
pub confidence_score: f32,
pub priority: i32,
pub code_location: ReviewCodeLocation,
}
/// Location of the code related to a review finding.
#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, TS)]
pub struct ReviewCodeLocation {
pub absolute_file_path: PathBuf,
pub line_range: ReviewLineRange,
}
/// Inclusive line range in a file associated with the finding.
#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, TS)]
pub struct ReviewLineRange {
pub start: u32,
pub end: u32,
}
#[derive(Debug, Clone, Deserialize, Serialize, TS)]
pub struct ExecCommandBeginEvent {
/// Identifier so this can be paired with the ExecCommandEnd event.

View File

@@ -1089,6 +1089,8 @@ impl ChatWidget {
self.app_event_tx
.send(crate::app_event::AppEvent::ConversationHistory(ev));
}
EventMsg::EnteredReviewMode(_) => {}
EventMsg::ExitedReviewMode(_) => {}
}
}

View File

@@ -122,6 +122,7 @@ pub async fn run_main(
let overrides = ConfigOverrides {
model,
review_model: None,
approval_policy,
sandbox_mode,
cwd,