Office
How to Keep Codex Background Jobs Running Without Stale Terminals (bgmon)
Solve terminal drop and stale-session failures in Codex workflows using bgmon detached execution, durable logs, and explicit RUNNING/DEAD/STALE state checks.
Published: 2026-02-10 · Last updated: 2026-02-10
Estimated read time: 5 min
The exact failure mode in long Codex sessions is not model quality, it is terminal fragility. If a PTY disconnects, you need job state that survives the shell, not shell history. bgmon solves this by persisting job metadata in ~/.bgmon/jobs/*.json and streaming stdout/stderr to durable logs in ~/.bgmon/logs/*.log.
In our live MolLEO pipeline, this pattern keeps the control chain deterministic. At one point the optimizer and follow-on stage were both active and inspectable from state, not from a still-open terminal:
NAME STATUS PID
c180_tuned_01_molleo_window_stage_01 RUNNING 530760
c180_tuned_01_window_to_deep04_post_01 RUNNING 567133
Launch is intentionally detached and session-safe: bgmon uses a new process session, no controlling stdin, and explicit log routing so work continues across reconnects.
proc = subprocess.Popen(
cmd,
stdin=subprocess.DEVNULL,
stdout=log_f,
stderr=subprocess.STDOUT,
cwd=args.cwd,
start_new_session=True,
close_fds=True,
)
It also defends against PID reuse. On status and stop operations, bgmon re-checks /proc/<pid>/stat starttime jiffies and marks the job STALE if the PID was recycled by another process, which prevents accidental signaling of unrelated work.
if cur != expected_starttime:
return (False, True) # stale pid reuse
With this control loop, workflows become queueable and composable: start stage A, watch deterministic status, tail logs, then trigger stage B from a post-hook. That is why bgmon is a must-have Codex operator tool for multi-hour and multi-day pipelines.
Internal links
- Index: /office
- Sibling: Output buffering and replay for long-running sessions
- Sibling: How CTOs increase LLM throughput (8-shard proof)
- Contact: /contact/
Scope boundary
This note covers: detached job control and workflow chaining for Codex and CLI pipelines.
This note excludes: strategy logic, crawler internals, and any execution policy beyond job lifecycle control.