Fifteen sections, ordered by how often you’re likely to hit each. Every section follows the same shape: Symptom → Diagnostic → Recovery. If a recovery step involves a command, it’s a play-button you can click.
Mobile access: see the PWA Terminal page (
/terminal.html) — that is the supported mobile-control surface. Termius / Tailscale /mobile_dispatch.shwere removed in v1.0.
1. git clone fails during install with auth error
Symptom: installer step 3 halts with remote: Repository not found or Authentication failed for 'https://github.com/...'.
Diagnostic:
gh auth status
gh repo view KiwiMaddog2020/endenza
Recovery:
- Not logged in → gh auth login and pick HTTPS / GitHub.com / paste token or browser.
- Logged in as the wrong account → gh auth logout then gh auth login again.
- Repo is private and you’re not a collaborator → ping the user for an invite before re-running the installer.
2. Tools blocked with ORCHESTRATION LOCK: message
Symptom: MCP VM tool calls (computer-use, Claude_Preview, Claude_in_Chrome) fail with ORCHESTRATION LOCK: mode=direct or ORCHESTRATION LOCK: autopilot held by '<slug>'.
This is correct behavior. The hook is doing its job.
Diagnostic:
jq '.mode, .mode_transitioning, .active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json
Recovery:
- mode=direct → say ***ORCHESTRATOR ON*** verbatim in a chat to flip back to orchestrated.
- mode_transitioning=true → shutdown cascade in progress; wait 30 s then re-check. If stuck, see §10.
- active_autopilot_chat != null and != you → another Agent holds the lock. Wait for ***AUTOPILOT COMPLETE*** or read the holder’s status file to see what they’re doing.
- Lock held by you but shouldn’t be → crash recovery. Force-release (see §4).
3. [ORCHESTRATOR] ... MISSING at SessionStart
Symptom: new chat’s context shows [ORCHESTRATOR] hard-enforcement hook MISSING at ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh or ... INSTALLED but not registered in ....
Diagnostic:
ls -la ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
jq '.hooks.PreToolUse' ~/.claude/settings.json
Recovery:
- Script missing → re-run the installer or git pull in ${CLAUDE_PLUGIN_ROOT} and chmod +x bin/*.sh.
- Script present, not in settings → copy templates/CLAUDE_CODE_SETTINGS.example.json contents into ~/.claude/settings.json, merging any existing hooks you want to keep.
- Script present, registered, but healthcheck still says MISSING → bug. Paste the healthcheck output into this chat for diagnosis.
4. Autopilot lock stuck
Symptom: jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json shows a slug, but the chat that held it is closed or unresponsive for > 45 min.
Diagnostic:
cat ${CLAUDE_PLUGIN_DATA}/state.lock.d/holder.json 2>/dev/null
stat -f '%Sm' ${CLAUDE_PLUGIN_DATA}/state.lock.d 2>/dev/null
Recovery:
- Wait 15 min → the stale_lock_sweeper.sh cron (if loaded via launchd) auto-releases locks older than 45 min. Tail /tmp/orchestrator-sweeper.out to watch.
- Force-release immediately:
rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d && jq '.active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json && echo "lock force-released"
- Log the manual release so the audit trail is complete:
echo "$(date -u +%FT%TZ) MANUAL_STEAL by kevin — stuck chat recovery" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log
5. settings.json merge conflict during install
Symptom: installer step 5 halts with “existing PreToolUse matcher conflicts with ours” or jq parse error.
Diagnostic:
jq '.hooks.PreToolUse[].matcher' ~/.claude/settings.json
Recovery: - Existing identical matcher → your previous hook is already there. Safe to skip step 5. - Existing different command on the same matcher → rename ours to use a unique matcher, or merge the two commands into a single shell script. Ping me with the diff and I’ll propose a merge. - Invalid JSON → back up + start fresh:
mv ~/.claude/settings.json ~/.claude/settings.json.broken-$(date +%s) && cp ${CLAUDE_PLUGIN_ROOT}/templates/CLAUDE_CODE_SETTINGS.example.json ~/.claude/settings.json
6. Scheduled routine didn’t fire overnight
Symptom: expected briefing at ${CLAUDE_PLUGIN_DATA}/briefings/YYYY-MM-DD.md but file doesn’t exist.
Diagnostic:
# Did the Mac sleep past the fire time?
pmset -g log | grep -i "wake\|sleep" | tail -20
# Is the Desktop app running?
pgrep -f "Claude.app" > /dev/null && echo "app running" || echo "app NOT running"
# What's scheduled?
ls -la ~/.claude/scheduled-tasks/
Recovery:
- Mac was asleep → Desktop scheduled tasks only fire when the Mac is awake + the app is running. Enable “Keep computer awake” in Desktop app settings, and don’t close the lid overnight.
- App wasn’t running → open Claude Desktop and the 7-day catch-up may trigger one replay. Otherwise the fire is lost; wait for tomorrow.
- Mode was direct at the scheduled time → routine no-oped silently by design. Flip mode back with ***ORCHESTRATOR ON***.
7. state.json is corrupt or missing
Symptom: jq . ${CLAUDE_PLUGIN_DATA}/state.json fails; hook healthcheck reports state unreadable; tools fail-closed.
Diagnostic:
ls -la ${CLAUDE_PLUGIN_DATA}/state.json*
cat ${CLAUDE_PLUGIN_DATA}/state.json
Recovery:
- Restore from latest backup snapshot (if backup_snapshot.sh has been running):
ls -la ${CLAUDE_PLUGIN_ROOT}/backups/ | tail -5
cp ${CLAUDE_PLUGIN_ROOT}/backups/<most-recent>/state.json ${CLAUDE_PLUGIN_DATA}/state.json
jq . ${CLAUDE_PLUGIN_DATA}/state.json && echo "restored"
- Or re-initialize from schema:
cat > ${CLAUDE_PLUGIN_DATA}/state.json <<'EOF'
{"schema_version":1,"mode":"orchestrated","mode_transitioning":false,"last_mode_change":null,"last_mode_change_reason":null,"active_autopilot_chat":null,"active_vm":null,"lock_acquired_at":null,"current_task":null,"orchestrator_automations":[],"automation_registry":[],"cascade":null,"notes":"Re-initialized $(date -u +%FT%TZ) after corruption."}
EOF
- Then re-register any
automation_registryentries by runningbin/status.shto confirm.
8. Launchd sweeper not firing
Symptom: locks older than 45 min sit un-stolen; lock_steals.log has no recent entries.
Diagnostic:
launchctl list | grep orchestrator
ls -la /tmp/orchestrator-sweeper.out /tmp/orchestrator-sweeper.err
tail -20 /tmp/orchestrator-sweeper.err 2>/dev/null
Recovery: - Not loaded → load it:
launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist
- Loaded but erroring → check stderr; common causes: python3 missing, script not executable (
chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/stale_lock_sweeper.sh), or the plist pointing at a stale path after a git pull moved files. - Reload after edits:
launchctl unload ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist 2>/dev/null
launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist
9. Permission prompt blocking automated work
Symptom: a scheduled routine or Channels-session request stalls on a Claude Code permission dialog that nobody’s there to click.
Diagnostic: check the session in claude.ai/code scheduled tasks sidebar — running tasks that pause for permission show a pending-approval state.
Recovery:
- Click “Run now” once in the Scheduled sidebar with you at the keyboard — approvals from that run are saved to the task and auto-applied to future fires.
- Or add the command to the starter allow-list by editing ~/.claude/settings.json:
{"permissions": {"allow": ["Bash(your-command *)"]}}
- For Channels sessions, start them with
--dangerously-skip-permissions— thePreToolUsehook still blocks VM tools (hooks run before perm check), so bypassing prompts does not bypass safety.
10. Shutdown cascade stuck (mode_transitioning=true forever)
Symptom: state.json.mode_transitioning shows true for > 1 minute; chats print “⏸ ORCHESTRATOR IN TRANSITION — standing by.” on every turn.
Diagnostic:
jq '.cascade' ${CLAUDE_PLUGIN_DATA}/state.json
Recovery:
- Check cascade.phase and cascade.executor_heartbeat ages.
- If phase != "done" and heartbeat > 30 s old, any chat can resume the cascade (Track D §3 takeover). Say ***ORCHESTRATOR OFF*** verbatim in any chat.
- Force-reset (last resort):
jq '.mode="direct" | .mode_transitioning=false | .cascade=null | .active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json
rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d
echo "$(date -u +%FT%TZ) MANUAL_CASCADE_RESET by kevin" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log
- Then
***ORCHESTRATOR ON***in a fresh chat to resume normal operation.
11. Chat slug mismatch — hook can’t identify the chat
Symptom: ORCHESTRATION LOCK: autopilot held by 'other-slug'. This chat ('unknown') must queue. when you expect the chat to BE the holder.
Diagnostic:
cat "$PWD/.claude/.chat_slug" 2>/dev/null
jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json
Recovery:
- .chat_slug missing → write it:
mkdir -p "$PWD/.claude" && echo "<your-slug>" > "$PWD/.claude/.chat_slug"
- Slug written but
cwdat tool-call time is a different directory (e.g. subagent working elsewhere) → move the slug up to the repo root, or use an absolute$CLAUDE_PROJECT_DIR-based path in your hook. - Slug mismatch vs
state.json→ just updatestate.json.active_autopilot_chatmanually if you own the lock:
jq '.active_autopilot_chat="<your-slug>"' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json
12. Hook smoke test fails during install
Symptom: installer step 7 reports ✗ Hook did not block. Exit code: 0. when mode was flipped to direct.
Diagnostic:
echo '{"tool_name":"mcp__computer-use__screenshot","cwd":"/tmp"}' | ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
echo "exit=$?"
which jq
test -x ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh && echo "executable" || echo "NOT executable"
Recovery:
- jq missing → brew install jq, re-run test.
- Not executable → chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/*.sh.
- Script reads $HOME from somewhere unexpected → set an explicit ORCH=${CLAUDE_PLUGIN_ROOT} at the top (already done in the shipped script).
- Still failing → set -x at the top of the hook, re-run, paste the trace.
13. gh auth surprises
Symptom: gh commands unexpectedly fail, the auto-GitHub extension can’t create repos, or git push prompts for a password.
Diagnostic:
gh auth status
git config --global credential.helper
Recovery:
- Multiple accounts → gh auth switch to the right one.
- Token expired → gh auth refresh.
- HTTPS vs SSH mismatch → check git remote -v and gh auth setup-git.
- Two-factor popping repeatedly → use a personal access token scoped to repo + workflow instead of browser auth.
14. Multi-machine sync surprises
Symptom: new Mac doesn’t have the same state as the old one; charters in the new clone are stale.
Recovery:
- Re-run installer on the new Mac.
- Clone your personal <gh-user>/my-ensemble-config repo to ${CLAUDE_PLUGIN_ROOT}-config/ (if you opted into the auto-GitHub extension):
git clone https://github.com/<you>/my-ensemble-config.git ${CLAUDE_PLUGIN_ROOT}-config
- Run
bin/sync-config.shto apply yourwork_hours.json+ allow-list delta from the config repo into the canonical paths. state.json,chats/, andstate.lock.d/are runtime-local and do not sync across machines by design.
15. Channels session dropped
Note: The exact Claude Code invocation for an iMessage Channels listener is plugin-documented, not a core
claudeCLI flag. The--channelsreferences in earlier drafts were speculative. Before relying on this recovery path, verify the current command via/plugin marketplaceand the installed iMessage plugin’s own documentation. Path B (Cloud Routine + iMessage Channels) is still v2-scope.
Symptom: iMessage commands to kill the Ensemble or trigger a routine don’t produce replies. The persistent tmux session that should be listening isn’t.
Diagnostic:
tmux ls 2>/dev/null # list tmux sessions
pgrep -fl claude | head # look for a running claude process (exact match string depends on install)
Recovery: - Not running → restart the session. Placeholder shape pending plugin-doc verification:
# Exact invocation TBD per installed iMessage plugin's docs.
cd ${CLAUDE_PLUGIN_ROOT} && tmux new-session -d -s orchestrator 'claude <channel-flags-per-plugin> --dangerously-skip-permissions'
- Frequent drops → enable launchd KeepAlive (ship a
com.kevin.orchestrator.channels.plistthat respawns the tmux session on crash). - Full Disk Access revoked → System Settings → Privacy & Security → Full Disk Access → add Terminal/iTerm back.
v2.0 failure modes (2026-04-26 contract sweep)
16. Per-project spawn lock conflict (refusal code 2)
Symptom: Autopilot fires fail with SPAWN-LOCK: per-project lock held for '<slug>' (pid=N). Refused.
Cause: Another autopilot is already targeting the same project. The per-project lock prevents two subprocesses from stomping on the same repo simultaneously.
Fix:
# See what's holding it
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh status
# If the holder PID is dead but lock dir lingers, sweep stale
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh sweep
# Or force release a specific slug (only if you're SURE no autopilot is running)
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh release <slug>
17. Concurrency cap hit + queue timeout (refusal code 3)
Symptom: SPAWN-LOCK: cap held >300s, no slot opened for '<slug>'. Refused.
Cause: state.json.max_concurrent_autopilots is at capacity, no slot opened within the 5-minute queue timeout.
Fix: Raise cap or kill an existing autopilot:
jq '.max_concurrent_autopilots = 3' state.json > s.tmp && mv s.tmp state.json
# or kill all running autopilots from the PWA Terminal page (kill switch)
# or manually:
pkill -f autopilot_session.sh
18. Resource floor refused (refusal code 4)
Symptom: SPAWN-LOCK: resource floor — RAM N% > threshold M%. Refused. (or CPU variant)
Cause: state.json.resource_floor_enabled = true and the system is taxed at spawn time.
Fix: Wait or relax thresholds:
jq '.resource_floor_thresholds.ram_pct = 95' state.json > s.tmp && mv s.tmp state.json
# or disable
jq '.resource_floor_enabled = false' state.json > s.tmp && mv s.tmp state.json
19. Mid-run resource watchdog killed an autopilot
Symptom: Autopilot session aborts mid-run; runs/watchdog-<date>.log shows KILLING parent process group (sustained pressure).
Cause: state.json.resource_watchdog_enabled = true and the watchdog detected sustained pressure (default: 3 consecutive breaches at 95% RAM or load > 6.0).
Fix: Accept and reduce parallelism, or relax thresholds:
jq '.resource_watchdog_enabled = false' state.json > s.tmp && mv s.tmp state.json
# or env-tune
WATCHDOG_RAM_PCT_THRESHOLD=98 WATCHDOG_BREACHES_TO_KILL=5 bash bin/autopilot_session.sh 60
20. Network pre-flight failure
Symptom: [autopilot] ABORT: network pre-flight failed (cannot reach api.anthropic.com).
Fix: Check connectivity. If offline by design, disable:
jq '.require_network_check = false' state.json > s.tmp && mv s.tmp state.json
21. Claude CLI version too old
Symptom: [autopilot] ABORT: claude CLI X.Y.Z < required A.B.C.
Fix:
npm i -g @anthropic-ai/claude-code
# or relax the pin
jq '.min_claude_cli_version = ""' state.json > s.tmp && mv s.tmp state.json
22. Project state is Hibernating
Symptom: [autopilot] ABORT: project '<slug>' is Hibernating; promote first.
Fix: Move it out of Hibernating before firing:
# Verbal in any chat:
"set <slug> to Building" # or Warmer / R&D / Updates / Launch Prep
# Or directly:
jq '.state = "Warmer"' chats/<slug>.json > s.tmp && mv s.tmp chats/<slug>.json
23. State.json corruption recovery
Symptom: state.json won’t parse; autopilot/hooks fail closed.
Recovery path (in order):
-
Check for atomic-rename leftover — every state-write uses
.tmp+mv. If a write was interrupted, look forstate.json.tmp:bash ls state.json* | head jq . state.json.tmp && mv state.json.tmp state.json -
Restore from backup_snapshot —
bin/backup_snapshot.shruns nightly:bash ls backups/state.json.* cp backups/state.json.YYYYMMDD-HHMMSS state.json -
Hand-rebuild from schema — if no backup, create minimum viable:
bash cat > state.json <<'EOF' { "schema_version": 1, "mode": "orchestrated", "mode_transitioning": false, "active_autopilot_chat": null, "active_vm": null, "lock_acquired_at": null, "lock_intent": null, "max_concurrent_autopilots": 2, "active_autopilots": {}, "resource_floor_enabled": false, "resource_floor_thresholds": {"ram_pct": 90, "cpu_load": 4.0}, "spawn_queue": [], "parallel_code_allowed": true, "rapid_fire_enabled": true, "caffeinate_during_autopilot": true, "require_ac_power": false, "min_free_disk_gb": 5, "require_network_check": true, "min_claude_cli_version": "", "resource_watchdog_enabled": false, "cascade": null } EOF -
Re-heartbeat all chats — after recovery, run a Maestro session that bumps each
chats/*.json.last_heartbeatso the dashboard reflects current state.
If you hit something not in this list, grab bin/status.sh output and a one-line symptom and ping the Maestro. The failure catalog grows from real incidents.