将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计

由 Markdown 原样翻译并转换为 Astro Starlight MDX 格式。

将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计

背景

Drill is a Python skill-compliance benchmark that lives in its own repo at obra/drill. It drives real tmux sessions, runs an LLM actor as a simulated user, runs an LLM verifier on the resulting transcript, and reports pass/fail per scenario. It supports Claude Code, Codex, Gemini CLI, and (per recent commits) OpenCode and Copilot CLI.

Drill is already the de facto eval harness for superpowers. The PRI-1397 commit series in the drill repo lifted ~22 superpowers bash tests into drill scenarios, and the most recent superpowers commit (a2292c5) explicitly removed a redundant bash test with the message “replaced by drill behavioral coverage”. Migration momentum exists; this spec completes it.

This work moves drill into superpowers under evals/, deletes the redundant bash tests after per-file 验证 of drill 场景 coverage, and updates docs so contributors land on the 新 structure.

目标

evals/ is the canonical eval harness in superpowers — full drill source, scenarios, fixtures, prompts, backend configs, and tests.
Bash tests in superpowers/tests/ that have been individually verified as 100% covered by drill scenarios are deleted; the rest are preserved.
The split between tests/ (plugin infrastructure: bash + node + python integration tests) and evals/ (LLM behavior with actor + verifier) is meaningful and documented.
Top-level docs (README.md, CLAUDE.md, docs/testing.md) point contributors at the right place.
The standalone obra/drill repo continues to exist (this PR does not touch it) and gets archived as a separate 手动 step after this PR merges.

非目标

CI integration. Manual-only here. The natural follow-up is “tiered”: fast subset on every PR, full sweep nightly + on-demand. That requires API budget decisions, GitHub Actions secrets, and a runner image with tmux + node + python + claude / codex / gemini CLIs installed. Out of scope.
Scenario co-location with skills. Scenarios stay centralized at evals/scenarios/. 如果 we later decide each skill should own its scenarios, that’s a path-find-and-rename operation; the YAML format does not change.
Renaming the internal Python package (drill → evals). The 目录 is evals/ (user-facing); the Python package keeps its drill name to keep the diff small. A short note in evals/README.md explains.
Drill repo archival. This PR does not touch obra/drill. After merge, the drill repo is archived manually (read-only on GitHub, README pointer to obra/superpowers/evals/).
Lifting tests/claude-code/analyze-token-usage.py into evals/bin/. Useful utility, not test code. Can move later; not 必需 by this PR.

Branching

Branch off dev as f/evals-lift. This work is independent of the open f/cross-platform PR — no shared file changes besides possibly README.md, which is small enough to resolve at merge time if it conflicts.

架构 after the move

superpowers/
  evals/                              ← NEW (full drill copy)
    pyproject.toml                    (Python 3.11, uv-managed)
    uv.lock
    .gitignore                        (drill's own; results/, .venv/, .env)
    README.md                         (was drill's README; install instructions updated)
    CLAUDE.md                         (was drill's CLAUDE.md; paths updated)
    docs/
      design.md                       (drill's design — preserved verbatim, cross-linked from this spec)
      manual-testing.md
      pressure-and-red-testing.md
    drill/                            (Python package; name kept; cli, engine, actor, verifier, etc.)
    backends/                         (claude-*.yaml, codex.yaml, gemini.yaml)
    scenarios/                        (32+ YAML scenarios)
    setup_helpers/                    (15 Python helpers; create_base_repo, sdd_*, spec_*, worktree, etc.)
    fixtures/                         (template-repo, sdd-go-fractals, sdd-svelte-todo)
    prompts/                          (actor.md, verifier.md)
    bin/                              (assertion helper scripts: tool-called, tool-count, etc.)
    tests/                            (drill's own pytest suite)

  tests/                              ← bash tests preserved by default
    brainstorm-server/                ← KEEP (node tests for brainstorm-server JS code)
    opencode/                         ← KEEP (plugin loading tests)
    codex-plugin-sync/                ← KEEP (sync verification)
    claude-code/                      ← MOSTLY KEEP — see deletion gate
    explicit-skill-requests/          ← KEEP unless verified replaced
    skill-triggering/                 ← KEEP unless verified replaced
    subagent-driven-dev/              ← KEEP unless verified replaced

  docs/
    testing.md                        ← UPDATED (split into "Plugin tests" + "Skill behavior evals")
    superpowers/
      specs/
        2026-05-06-lift-drill-into-evals-design.md   ← THIS SPEC

  README.md                           ← small Contributing-section pointer to evals/
  CLAUDE.md                           ← one-line "Eval harness lives at evals/" pointer

The tests/ and evals/ 目录 serve clearly distinct roles after this PR:

tests/ — does the plugin’s non-LLM code work? Unit and integration tests for the brainstorm-server JS code, OpenCode plugin 加载, codex-plugin-sync sync verification. Bash + node + python.
evals/ — do agents behave correctly on real LLM sessions? Drill scenarios with actor + verifier. Python-only, runs real tmux sessions.

Deletion gate (per bash test)

A bash test is deleted only if a drill 场景 verifiably covers every assertion it makes. The 实施计划 documents this 验证 per file: read the bash test, list its checks, find the drill 场景, confirm each check has a matching verify.assertions or verify.criteria entry. 如果 even one check is missing, the option is to either extend the drill 场景 or keep the bash test. Default keeps it.

Tentative coverage map (commit-message-based; needs per-file 验证 before any deletion):

Bash test	Claimed drill replacement	Coverage status
`tests/skill-triggering/prompts/*` (6 提示词 files)	`triggering-*.yaml` (6 scenarios)	candidate — verify per-prompt before deleting
`tests/skill-triggering/run-test.sh`, `run-all.sh`	n/a (runners, not tests)	keep — runner scripts
`tests/explicit-skill-requests/prompts/please-use-brainstorming.txt`	needs 验证 — drill has no obvious counterpart yet	likely keep unless drill 场景 added
`tests/explicit-skill-requests/prompts/use-systematic-debugging.txt`	needs 验证 — drill has no obvious counterpart	likely keep unless drill 场景 added
`tests/explicit-skill-requests/run-claude-describes-sdd.sh`	partially → `mid-conversation-skill-invocation.yaml`	candidate — verify per-script
`tests/explicit-skill-requests/run-haiku-test.sh`	no drill 场景 covers Haiku-specific behavior	keep
`tests/explicit-skill-requests/run-multiturn-test.sh`, `run-extended-multiturn-test.sh`	no drill 场景 covers multi-turn build-up	keep unless drill scenarios added
`tests/explicit-skill-requests/run-test.sh`, `run-all.sh`	n/a (runners)	keep
`tests/subagent-driven-dev/go-fractals/`, `tests/subagent-driven-dev/svelte-todo/`	`sdd-go-fractals.yaml`, `sdd-svelte-todo.yaml`	candidate — verify before deleting (these include real assertions about test suites passing)
`tests/claude-code/test-document-review-system.sh`	`spec-reviewer-catches-planted-flaws.yaml`	candidate — verify before deleting
`tests/claude-code/test-requesting-code-review.sh`	`code-review-catches-planted-bugs.yaml`	candidate — verify before deleting
`tests/claude-code/test-subagent-driven-development-integration.sh`	`sdd-rejects-extra-features.yaml` (YAGNI subset)	partial — bash test also asserts ≥3 commits / `npm test` passes / runs `analyze-token-usage.py`. Drill 场景 asserts forbidden-exports + reviewer-as-gate. Mostly disjoint — almost certainly keep + extend drill 场景.
`tests/claude-code/test-subagent-driven-development.sh`	meta/documentation test (asks agent to describe SDD); no drill 场景 covers description tests	keep unless drill 场景 added
`tests/claude-code/test-worktree-native-preference.sh`	`worktree-creation-under-pressure.yaml`	candidate — verify before deleting
`tests/claude-code/test-helpers.sh`, `run-skill-tests.sh`, `analyze-token-usage.py`	n/a (utilities, not tests)	keep — libraries/tools

验证 protocol (subagent-gated)

Every change in the 实施计划 gets cross-checked by an independent 子 agent before commit.

Change category	Subagent 验证
Each bash-test deletion	Dispatch a 子 agent with: (a) the bash test file content, (b) the candidate drill 场景 YAML, (c) the 提示词: “List every assertion the bash test makes. List every verify entry in the drill scenario. 对于 each bash assertion, find a matching drill check or 报告 it as unmatched. Output a per-assertion table.” The 子 agent’s output is the gate — only delete if every bash assertion has a match.
Initial `evals/` copy	Subagent verifies: (a) drill SHA being copied is recorded in the lift 提交消息 so provenance is auditable; (b) per-file SHA-256 校验和 matches drill repo for every file (not just file count); (c) excluded paths (`.git/`, `.venv/`, `results/`, `.env`, `__pycache__/`, `*.egg-info/`, any `.private-journal/`) are absent from `evals/`; (d) all backend YAMLs 引用 paths that exist post-move; (e) `pyproject.toml`, `uv.lock`, `.gitignore` are intact.
Drill’s own pytest suite	Subagent runs `cd evals && uv run pytest` after the path-default change. Drill ships its own pytest suite at `evals/tests/` including `test_backend.py` which exercises `SUPERPOWERS_ROOT` env-var behavior — these tests must update to match the helper and continue to pass.
Reference scrubbing after deletion	Subagent greps the entire superpowers tree (excluding `node_modules/`, `.venv/`, and `evals/`) for 引用 to deleted bash test paths. Search targets: `docs/`, `docs/superpowers/plans/`, `RELEASE-NOTES.md`, `CLAUDE.md`, `GEMINI.md`, `AGENTS.md`, `README.md`, `.github/`, `scripts/`, `.opencode/INSTALL.md`, `.codex-plugin/INSTALL.md`, `lefthook.yml`. Any hit is either updated or surfaces a missed dependency.
Path defaults change (`SUPERPOWERS_ROOT` default)	Subagent runs at least one cheap drill 场景 after the path changes (e.g., `triggering-test-driven-development`) and confirms it still passes. Real validation, not just code review.
Final pre-PR adversarial 审查	Two subagents in parallel, “5 points to whoever finds the most legitimate issues” framing — same protocol used on the cross-platform PR. 验证 both source code and behavior.

Each 子 agent 任务 gets its own bullet in the 实施计划 with explicit inputs and pass criteria. The 子 agent’s output is summarized in the relevant 提交消息 (“Subagent 验证: …”) so the trail is auditable.

Concrete path/config edits

Verified prior to writing this spec. drill/cli.py defines PROJECT_ROOT = Path(__file__).parent.parent. After the move, cli.py lives at evals/drill/cli.py, so PROJECT_ROOT resolves to evals/ and PROJECT_ROOT.parent resolves to the superpowers repo root. That’s the value SUPERPOWERS_ROOT should take by default.

YAML substitution audit. Only the four claude*.yaml backend configs interpolate ${SUPERPOWERS_ROOT} into args (for the --plugin-dir flag); codex.yaml and gemini.yaml only list SUPERPOWERS_ROOT in required_env (consumed by engine.py:233 / setup.py:25’s os.environ["SUPERPOWERS_ROOT"] lookups in pre/post-run hooks). The helper’s os.environ mutation covers both code paths.

File	Current	After
`drill/cli.py`	`load_dotenv(PROJECT_ROOT / ".env")` at module import; nothing about `SUPERPOWERS_ROOT`	After `load_dotenv`, call 新 helper `_set_superpowers_root_default()` that sets `os.environ["SUPERPOWERS_ROOT"]` to `str(PROJECT_ROOT.parent)` if and only if not already set. Order: `load_dotenv` → set default → click group definitions.
`drill/engine.py:233`, `drill/setup.py:25`	Direct `os.environ["SUPERPOWERS_ROOT"]` access (KeyError if unset)	Unchanged. The CLI startup hook guarantees the env var is set by the time the engine/setup execute.
`backends/claude*.yaml` (5 files)	`${SUPERPOWERS_ROOT}` substituted in `args` for `--plugin-dir`	Unchanged. YAML substitution 读取 `os.environ` at backend-load time, which is after CLI startup.
`backends/codex.yaml`, `backends/gemini.yaml`	`SUPERPOWERS_ROOT` in `required_env` only	Drop from `required_env` (the helper supplies it). `claude*.yaml` keep `required_env` for backward compat (env var works as override).
`evals/tests/test_backend.py`	Tests assert `SUPERPOWERS_ROOT` is in `required_env` lists, plus path-resolution tests	更新 tests to match the 新 contract: helper-supplied default, env override still works, `required_env` no longer 必需 for codex/gemini.
`evals/README.md`	”export SUPERPOWERS_ROOT=/path/to/superpowers”	Drop the export line; note that the env var auto-defaults to the parent of `evals/`; mention the only 必需 setup is `ANTHROPIC_API_KEY` (or `OPENAI_API_KEY` / Gemini auth).
`evals/CLAUDE.md`	Same	Same
`evals/.gitignore`	drill’s 现有 patterns (`results/`, `.venv/`, `__pycache__/`, `.env`, `.pyc`, `.egg-info/`, `dist/`, `build/`, `.claude/`)	Copied verbatim. Patterns are relative to file location, so they apply correctly under `evals/`.
`evals/lefthook.yml`	drill ships `lefthook.yml` defining `pre-commit: uv run ruff check && uv run ty check`	Move to `evals/lefthook.yml`. Either (a) install lefthook at the superpowers root and have it federate to `evals/lefthook.yml`, or (b) document that contributors run `cd evals && lefthook run pre-commit` manually. Decision in 实施: option (b) for simplicity — superpowers’ top-level 工作流 doesn’t change.

.env placement: keep evals/.env (gitignored). Contributors source it from there or set ANTHROPIC_API_KEY in their shell environment.

Top-level superpowers files needing small additions:

superpowers/.gitignore: add evals/results/, evals/.venv/, evals/.env (belt-and-suspenders; evals/.gitignore already covers these locally).
superpowers/CLAUDE.md: add a one-line pointer “Eval harness lives at evals/ — see evals/README.md” so agents discover it.
superpowers/docs/testing.md: split into ”## Plugin tests” (现有 tests/ content, with the deleted-test 引用 trimmed) and ”## Skill behavior evals” (one-paragraph summary + pointer to evals/).
superpowers/README.md: add a single line in the Contributing section pointing at evals/ for skill-behavior testing.

Migration ordering

Each step is a separate commit (or small group of commits). 步骤 2 is the biggest single commit (the verbatim drill copy); subsequent steps are small and atomic.

1. Branch off `dev` (f/evals-lift)

2. Copy drill repo into evals/ (single commit, easy to revert)
   ├─ Record drill SHA at copy time → commit message
   ├─ Use `rsync -a --exclude=.git --exclude=.venv --exclude=results
   │  --exclude=.env --exclude=__pycache__ --exclude='*.egg-info'
   │  --exclude=.private-journal /path/to/drill/ evals/`
   │  (rsync chosen over `cp -r` for explicit excludes; verify with
   │  `find evals -name '.git' -type d` returns nothing)
   ├─ Subagent gate: per-file SHA-256 checksum matches drill repo for every
   │  non-excluded file; excluded paths absent from evals/
   └─ Smoke check: `cd evals && uv sync` succeeds (proves install only;
      not a behavioral test)

3. Update path defaults
   ├─ Add _set_superpowers_root_default() helper to drill/cli.py
   ├─ Wire it after load_dotenv, before click group definition
   ├─ Update evals/README.md and evals/CLAUDE.md (drop SUPERPOWERS_ROOT install step)
   ├─ Drop SUPERPOWERS_ROOT from required_env in codex.yaml/gemini.yaml
   │  (keep in claude*.yaml as override)
   └─ Update evals/tests/test_backend.py to match new contract

4. Validate from new location (TWO checks)
   ├─ Run drill's own pytest: `cd evals && uv run pytest` — must pass
   └─ Run cheap drill scenario: `cd evals && uv run drill run
      triggering-test-driven-development -b claude` — must pass.
      Real behavioral validation, not just code review.

5. Bash test deletion phase — per-file with subagent gate
   For each file in the candidate-deletion list:
   a. Subagent compares bash test assertions vs drill scenario verify block
   b. Pass criterion: every bash assertion has a matching drill check
   c. If pass → delete the bash test file (one commit per file or per
      coherent group)
   d. If fail → either extend drill scenario (separate commit + verify) or
      keep the bash test (no commit)

6. Stale-reference scrub
   ├─ Subagent greps the superpowers tree (excluding node_modules/, .venv/,
   │  evals/) for deleted file paths
   ├─ Search targets: docs/, docs/superpowers/plans/, RELEASE-NOTES.md,
   │  CLAUDE.md, GEMINI.md, AGENTS.md, README.md, .github/, scripts/,
   │  .opencode/INSTALL.md, .codex-plugin/INSTALL.md, lefthook.yml
   ├─ Update active references (e.g., docs/testing.md, README.md install)
   └─ Historical references in docs/superpowers/plans/*.md and
      RELEASE-NOTES.md are PRESERVED with a brief annotation
      ("(test removed; behavior covered by drill scenario X)") rather
      than rewritten — these are dated artifacts, not living docs.

7. Top-level docs
   ├─ docs/testing.md split
   ├─ CLAUDE.md pointer
   └─ README.md Contributing section

8. Re-run smoke checks (regression gate)
   ├─ `cd evals && uv run pytest`
   └─ `cd evals && uv run drill run triggering-test-driven-development -b claude`

9. Final adversarial review
   └─ Two parallel subagents, full diff, "5 points to whoever finds the
      most legitimate issues" framing. Address findings before push.

10. Push branch + open PR against dev
    └─ PR description includes: drill SHA pinned at copy, archival action
       item ("after merge: archive obra/drill, add README pointer to
       obra/superpowers/evals/"), per-deleted-file coverage receipts.

验证 (post-implementation)

The 实施计划 must show:

All non-excluded drill source files present at evals/ after step 2 (子 agent per-file SHA-256 校验和 diff vs obra/drill@<recorded-sha>).
Excluded paths (.git/, .venv/, results/, .env, __pycache__/, *.egg-info/, .private-journal/) absent from evals/.
The step-2 提交消息 records the drill source SHA.
cd evals && uv sync succeeds without SUPERPOWERS_ROOT set.
cd evals && uv run pytest passes (drill’s own pytest suite).
cd evals && uv run drill list returns the same 场景 count as the standalone drill repo at the recorded SHA.
cd evals && uv run drill run triggering-test-driven-development -b claude passes (proves path defaults work end-to-end).
对于 each deleted bash test: 子 agent 验证 table in the 提交消息 showing every assertion mapped to a drill check.
Grep for deleted file paths returns zero hits across living superpowers docs (post step 6); historical refs in docs/superpowers/plans/*.md and RELEASE-NOTES.md are annotated, not rewritten.
docs/testing.md has both “Plugin tests” and “Skill behavior evals” sections.
The drill repo’s history is untouched; obra/drill is unaffected by this PR.
PR description names the action item to archive obra/drill after merge.

打开 questions

None. All clarifying decisions have been made:

Question	Decision
Where does drill live in superpowers?	`evals/` (rename from drill); standalone repo archived as separate step
Fate of redundant bash tests?	删除 per-file with 子 agent 验证 of coverage; default keep
Scenarios layout?	Centralized at `evals/scenarios/`
Python toolchain placement?	Self-contained at `evals/`
CI integration?	Manual-only this PR; documented future path
Migration mechanics?	Plain copy; drill repo’s history preserved in archived repo, not in-tree
Internal Python package name?	Keep as `drill` (目录 is `evals/`)
Branching strategy?	Independent off `dev` (not stacked on `f/cross-platform`)

Plans

Specs

将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计

将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计

背景

目标

非目标

Branching

架构 after the move

Deletion gate (per bash test)

验证 protocol (subagent-gated)

Concrete path/config edits

Migration ordering

验证 (post-implementation)

打开 questions

Get started

Windows

Plans

Superpowers

将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计

将 drill 提升到 Superpowers 的 evals/ 中 —— 设计

背景

目标

非目标

Branching

架构 after the move

Deletion gate (per bash test)

验证 protocol (subagent-gated)

Concrete path/config edits

Migration ordering

验证 (post-implementation)

打开 questions

Get started

Windows

Plans

Superpowers

将 drill 提升到 Superpowers 的 `evals/` 中 —— 设计