将 drill 提升到 Superpowers 的 evals/ 中 —— 实施计划
Section titled “将 drill 提升到 Superpowers 的 evals/ 中 —— 实施计划”对于 agentic workers: REQUIRED SUB-SKILL: 使用 superpowers:subagent-driven-development (推荐) or superpowers:executing-plans to implement this 计划 task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Move the standalone obra/drill skill-compliance benchmark into superpowers as a top-level evals/ 目录, delete redundant bash tests under superpowers/tests/ after per-file 子 agent 验证 of drill 场景 coverage, and update top-level docs so contributors land on the 新 structure.
架构: Single PR against dev on a 新 branch f/evals-lift. Drill source is copied verbatim with explicit rsync excludes to keep .git/, .venv/, etc. out of the 新 dir. A small helper in drill/cli.py defaults SUPERPOWERS_ROOT to the parent of the evals/ 目录, so contributors don’t have to set the env var. Each bash-test deletion is gated by a 子 agent that compares the bash test’s assertions to its claimed drill 场景’s verify block. Historical 引用 in 计划 docs and release notes are annotated, not rewritten.
Tech Stack: Python 3.11 + uv (drill’s 现有 toolchain, unchanged); rsync; bash; git.
Spec: docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md — read this first.
Drill source location: /Users/jesse/Documents/GitHub/superpowers/drill/ (sibling to superpowers/).
Task 1: Branch off dev
Section titled “Task 1: Branch off dev”文件: none (git operation only)
- 步骤 1: 验证 clean working tree
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit status --short预期: empty output (or only untracked .opencode/package-lock.json, which is fine).
- 步骤 2: Fetch latest dev
git fetch origin dev:dev- 步骤 3: 创建 the branch
git checkout -b f/evals-lift dev预期: Switched to a new branch 'f/evals-lift'.
- 步骤 4: Sanity check
git log --oneline -1预期 output begins with whatever commit origin/dev points to (currently b4363df docs: turned the dash in "- Jesse" into an escape sequence (#1474)).
Task 2: Capture drill SHA at copy time
Section titled “Task 2: Capture drill SHA at copy time”文件: none (records the value for the lift 提交消息)
- 步骤 1: Get the 当前 drill HEAD SHA
cd /Users/jesse/Documents/GitHub/superpowers/drillDRILL_SHA=$(git rev-parse HEAD)echo "$DRILL_SHA"- 步骤 2: 验证 drill has no uncommitted work
cd /Users/jesse/Documents/GitHub/superpowers/drillgit status --short预期: empty (no untracked or modified files). 如果 output is non-empty, 停止 and 报告 — drill working tree must be clean before lift, otherwise the SHA-pin is meaningless.
- 步骤 3: Save the SHA in shell env for next 任务
echo "DRILL_SHA=$DRILL_SHA" # write this down for use in Task 3Task 3: rsync drill into evals/
Section titled “Task 3: rsync drill into evals/”文件:
-
创建:
evals/(entire 目录 tree from drill, minus excludes) -
步骤 1: 验证 source and destination paths
cd /Users/jesse/Documents/GitHub/superpowers/superpowerstest -d /Users/jesse/Documents/GitHub/superpowers/drill && echo "drill source: OK"test ! -d evals && echo "evals/ does not yet exist: OK"预期: both echoes print.
- 步骤 2: rsync drill to evals/ with explicit excludes
cd /Users/jesse/Documents/GitHub/superpowers/superpowersrsync -a \ --exclude=.git \ --exclude=.venv \ --exclude=results \ --exclude=.env \ --exclude=__pycache__ \ --exclude='*.egg-info' \ --exclude=.private-journal \ --exclude='*.pyc' \ /Users/jesse/Documents/GitHub/superpowers/drill/ \ evals/- 步骤 3: 验证 excludes worked
find evals -name '.git' -type dfind evals -name '.venv' -type dfind evals -name 'results' -type dfind evals -name '.env'find evals -name '__pycache__' -type dfind evals -name '*.egg-info' -type d预期: every command returns no output. 如果 any returns a path, manually rm -rf it before continuing.
- 步骤 4: Confirm the source SHA for the 提交消息
cd /Users/jesse/Documents/GitHub/superpowers/drillDRILL_SHA=$(git rev-parse HEAD)echo "$DRILL_SHA"预期: the SHA from Task 2 step 1.
- 步骤 5: Stage everything
git add evals/git status --short | head -20预期 output starts with A evals/... lines listing many added files. Many of these are in scenarios/, drill/, backends/, setup_helpers/, etc.
- 步骤 6: 提交
: "${DRILL_SHA:?Set DRILL_SHA from Task 2 before committing}"git commit -m "$(cat <<EOFLift drill into evals/ at $DRILL_SHA
rsync of obra/drill@$DRILL_SHA into superpowers/evals/, excluding.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,.private-journal/.
The drill repo is unaffected by this commit; archival is a separatemanual step after this PR merges.
Source SHA recorded in this commit message for provenance.EOF)"Task 4: 验证 the copy with checksums
Section titled “Task 4: 验证 the copy with checksums”文件: none (验证 only)
- 步骤 1: Get list of files that exist in drill but should NOT be in evals (the excludes)
cd /Users/jesse/Documents/GitHub/superpowers/drillfind . \ \( -name '.git' -prune \ -o -name '.venv' -prune \ -o -name 'results' -prune \ -o -name '__pycache__' -prune \ -o -name '*.egg-info' -prune \ -o -name '.private-journal' -prune \ -o -name '*.pyc' -prune \ -o -name '.env' -prune \) \ -o -type f -print | sort > /tmp/drill-files.txtwc -l /tmp/drill-files.txt- 步骤 2: Get list of files in evals/
cd /Users/jesse/Documents/GitHub/superpowers/superpowersfind evals -type f | sed 's|^evals/|./|' | sort > /tmp/evals-files.txtwc -l /tmp/evals-files.txt- 步骤 3: Diff the two lists
The file lists should match exactly after excluded paths are removed.
diff /tmp/drill-files.txt /tmp/evals-files.txt预期: no output.
- 步骤 4: Per-file 校验和 验证
cd /Users/jesse/Documents/GitHub/superpowers/drillwhile read -r f; do sha1=$(shasum -a 256 "$f" | cut -d' ' -f1) sha2=$(shasum -a 256 "/Users/jesse/Documents/GitHub/superpowers/superpowers/evals/${f#./}" | cut -d' ' -f1) if [ "$sha1" != "$sha2" ]; then echo "MISMATCH: $f ($sha1 vs $sha2)" fidone < /tmp/drill-files.txt | head -20预期: no output (every file’s 校验和 matches between drill and evals).
- 步骤 5: Smoke check - install 依赖
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv sync预期: Installed N packages or similar. No errors.
- 步骤 6: Smoke check - drill list
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run drill list 2>&1 | head -5预期: starts with 场景 names. (Will likely 错误 or warn about missing SUPERPOWERS_ROOT — that’s fine, fixed in next task.)
- 步骤 7: Dispatch 验证 子 agent
Dispatch a general-purpose 子 agent with this 提示词:
You are verifying a verbatim copy of the drill repo at/Users/jesse/Documents/GitHub/superpowers/drill into/Users/jesse/Documents/GitHub/superpowers/superpowers/evals.
Verify:
1. The lift commit message records the SHA reported by: cd /Users/jesse/Documents/GitHub/superpowers/drill && git rev-parse HEAD
2. None of these excluded paths exist under evals/: .git/, .venv/,results/, .env/, __pycache__/, *.egg-info/, .private-journal/.
3. Every non-excluded file in drill has a SHA-256-identicalcounterpart in evals/, and there are no extra files in evals/.
4. The pyproject.toml, uv.lock, scenarios/*.yaml, backends/*.yaml,setup_helpers/*.py, drill/*.py, prompts/*.md, fixtures/, bin/, anddocs/ are all present.
Report each check with PASS/FAIL. If any FAIL, dump enough detailthat the parent can fix.如果 the 子 agent reports any FAIL, fix the underlying issue (delete the leaked file, re-rsync, etc.) before continuing.
Task 5: 添加 SUPERPOWERS_ROOT default helper
Section titled “Task 5: 添加 SUPERPOWERS_ROOT default helper”文件:
-
修改:
evals/drill/cli.py:11-14 -
步骤 1: Read the 当前 cli.py header
sed -n '1,20p' /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/drill/cli.py预期 output:
"""Drill CLI: run, compare, list."""
from __future__ import annotations
import secretsfrom pathlib import Path
import clickfrom dotenv import load_dotenv
PROJECT_ROOT: Path = Path(__file__).parent.parent
load_dotenv(PROJECT_ROOT / ".env")- 步骤 2: Write a failing test for the helper
打开 evals/tests/test_cli.py and add this test at the end:
def test_set_superpowers_root_default_when_unset(monkeypatch, tmp_path): """When SUPERPOWERS_ROOT is unset, helper sets it to PROJECT_ROOT.parent.""" monkeypatch.delenv("SUPERPOWERS_ROOT", raising=False) from drill.cli import _set_superpowers_root_default, PROJECT_ROOT
_set_superpowers_root_default()
import os assert os.environ["SUPERPOWERS_ROOT"] == str(PROJECT_ROOT.parent)
def test_set_superpowers_root_default_respects_existing(monkeypatch): """When SUPERPOWERS_ROOT is already set, helper does not override.""" monkeypatch.setenv("SUPERPOWERS_ROOT", "/custom/path") from drill.cli import _set_superpowers_root_default
_set_superpowers_root_default()
import os assert os.environ["SUPERPOWERS_ROOT"] == "/custom/path"- 步骤 3: 运行 the test and watch it fail
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run pytest tests/test_cli.py -k set_superpowers_root_default -v预期: 2 tests fail with AttributeError: module 'drill.cli' has no attribute '_set_superpowers_root_default'.
- 步骤 4: 添加 the helper to cli.py
Edit /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/drill/cli.py. 替换 lines 1–14 with:
"""Drill CLI: run, compare, list."""
from __future__ import annotations
import osimport secretsfrom pathlib import Path
import clickfrom dotenv import load_dotenv
PROJECT_ROOT: Path = Path(__file__).parent.parent
load_dotenv(PROJECT_ROOT / ".env")
def _set_superpowers_root_default() -> None: """Default SUPERPOWERS_ROOT to the parent of evals/ if not already set.
Drill historically required contributors to export SUPERPOWERS_ROOT pointing at the superpowers checkout. After lifting drill into superpowers/evals/, the parent of PROJECT_ROOT is always the superpowers root, so we can supply this default automatically.
Existing SUPERPOWERS_ROOT environment values are respected as overrides. """ os.environ.setdefault("SUPERPOWERS_ROOT", str(PROJECT_ROOT.parent))
_set_superpowers_root_default()The bottom-of-module call to _set_superpowers_root_default() runs at import time, immediately after load_dotenv(). This ensures both engine.py and setup.py (which read os.environ["SUPERPOWERS_ROOT"] directly) and the YAML interpolation (which 读取 os.environ when the backend YAML is 已加载) all see the value.
- 步骤 5: 运行 the test and watch it pass
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run pytest tests/test_cli.py -k set_superpowers_root_default -v预期: 2 tests pass.
- 步骤 6: 提交
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add evals/drill/cli.py evals/tests/test_cli.pygit commit -m "evals: default SUPERPOWERS_ROOT to parent of evals/ if unset
Adds _set_superpowers_root_default() to drill/cli.py, called atmodule import after load_dotenv(). PROJECT_ROOT resolves to evals/post-lift; its parent is the superpowers repo root, which is thecorrect value for SUPERPOWERS_ROOT.
Existing env values are respected as overrides via os.environ.setdefault.
Tests:- helper sets default when var is unset- helper does not override when var is already set"Task 6: 更新 backend YAMLs to reflect the 新 env contract
Section titled “Task 6: 更新 backend YAMLs to reflect the 新 env contract”文件:
- 修改:
evals/backends/codex.yaml(dropSUPERPOWERS_ROOTfromrequired_env) - 修改:
evals/backends/gemini.yaml(dropSUPERPOWERS_ROOTfromrequired_env)
The five claude*.yaml backend configs interpolate ${SUPERPOWERS_ROOT} into args for the --plugin-dir flag — they keep SUPERPOWERS_ROOT in required_env because the interpolation needs it. The codex/gemini configs only listed it for engine.py/setup.py’s os.environ 读取, which the helper now satisfies.
- 步骤 1: Confirm 当前 state
grep -A3 'required_env:' /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/backends/codex.yamlgrep -A2 'required_env:' /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/backends/gemini.yaml预期 outputs include - SUPERPOWERS_ROOT lines.
- 步骤 2: Read codex.yaml fully
cat /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/backends/codex.yaml- 步骤 3: Edit codex.yaml — drop the
- SUPERPOWERS_ROOTline underrequired_env
打开 evals/backends/codex.yaml and find:
required_env: - OPENAI_API_KEY - SUPERPOWERS_ROOT替换为:
required_env: - OPENAI_API_KEY- 步骤 4: Edit gemini.yaml — drop the
- SUPERPOWERS_ROOTline underrequired_env
打开 evals/backends/gemini.yaml and find:
required_env: - SUPERPOWERS_ROOT替换为:
required_env: [](Empty list rather than dropping the field, so YAML schema validation doesn’t trip.)
- 步骤 5: 运行 drill’s pytest suite to ensure nothing broke
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run pytest -x 2>&1 | tail -20预期: all tests pass. 如果 tests/test_backend.py complains about required_env membership for codex/gemini, see Task 7.
- 步骤 6: 提交
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add evals/backends/codex.yaml evals/backends/gemini.yamlgit commit -m "evals: drop SUPERPOWERS_ROOT from codex/gemini required_env
These backends only read SUPERPOWERS_ROOT via engine.py/setup.py'sos.environ access, which the new cli.py default helper suppliesautomatically. claude*.yaml keep SUPERPOWERS_ROOT in required_envbecause they interpolate \${SUPERPOWERS_ROOT} into --plugin-dir args."Task 7: 更新 drill’s pytest suite for the 新 contract
Section titled “Task 7: 更新 drill’s pytest suite for the 新 contract”文件:
-
修改:
evals/tests/test_backend.py(per-test updates if Task 6 step 5 surfaced 失败) -
步骤 1: 运行 the test suite
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run pytest tests/test_backend.py -v 2>&1 | tail -30如果 all tests pass, 跳过 to step 5 (commit nothing, move to Task 8). Otherwise:
- 步骤 2: Read failing tests
对于 each 失败, open the test in evals/tests/test_backend.py and read the assertion.
- 步骤 3: 更新 assertions
对于 tests that assert SUPERPOWERS_ROOT membership in codex.yaml’s or gemini.yaml’s required_env: invert the assertion to confirm absence. Example:
# Before:def test_codex_requires_superpowers_root(): backend = load_backend("codex") assert "SUPERPOWERS_ROOT" in backend.required_env
# After:def test_codex_does_not_require_superpowers_root(): """codex.yaml dropped SUPERPOWERS_ROOT from required_env; the cli.py helper supplies the default.""" backend = load_backend("codex") assert "SUPERPOWERS_ROOT" not in backend.required_env- 步骤 4: Re-run the test suite
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsuv run pytest -x 2>&1 | tail -10预期:所有测试通过。
- 步骤 5: 提交 (only if step 1 had 失败)
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add evals/tests/test_backend.pygit commit -m "evals: update test_backend.py for relaxed required_env contract"Task 8: 更新 evals/README.md and evals/CLAUDE.md
Section titled “Task 8: 更新 evals/README.md and evals/CLAUDE.md”文件:
-
修改:
evals/README.md(drop SUPERPOWERS_ROOT setup step) -
修改:
evals/CLAUDE.md(drop SUPERPOWERS_ROOT setup step) -
步骤 1: Edit evals/README.md
Find the section that looks like:
Required environment:```bashexport SUPERPOWERS_ROOT=/path/to/superpowersexport ANTHROPIC_API_KEY=sk-...替换为:
```markdownRequired environment:```bashexport ANTHROPIC_API_KEY=sk-...SUPERPOWERS_ROOT defaults to the parent of evals/ (the superpowers repo root) and only needs to be set if you’re running drill against a different superpowers checkout.
- [ ] **步骤 2: Edit evals/CLAUDE.md**
Find the section:
```markdown## Required envSUPERPOWERS_ROOT=/path/to/superpowers ANTHROPIC_API_KEY=sk-…
替换为:
## Required envANTHROPIC_API_KEY=sk-…
`SUPERPOWERS_ROOT` defaults to the parent of `evals/` (the superpowers repo root). Override only if running drill against a different superpowers checkout.- 步骤 3: 提交
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add evals/README.md evals/CLAUDE.mdgit commit -m "evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE
The cli.py helper now defaults the env var. Mention as override only."Task 9: Validate from 新 location
Section titled “Task 9: Validate from 新 location”文件: none (validation only — no commit unless something needs fixing)
- 步骤 1: 运行 drill’s full pytest suite
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsunset SUPERPOWERS_ROOTuv run pytest 2>&1 | tail -5预期: all tests pass. The unset ensures we’re testing the helper, not an inherited env var.
- 步骤 2: 运行 drill list
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsunset SUPERPOWERS_ROOTuv run drill list 2>&1 | head -10预期: 场景 list, no 错误 about missing SUPERPOWERS_ROOT.
- 步骤 3: Source the env file
set -asource /Users/jesse/Documents/GitHub/prime-radiant-inc/sprout/.envset +aecho "ANTHROPIC_API_KEY set: ${ANTHROPIC_API_KEY:+yes}"预期: ANTHROPIC_API_KEY set: yes.
- 步骤 4: 运行 a cheap drill 场景
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsunset SUPERPOWERS_ROOTuv run drill run triggering-test-driven-development -b claude 2>&1 | tail -3预期: claude: 1 passed, 0 failed, 0 errors.
如果 FAIL, debug before continuing. The path-defaults change is the most likely culprit; check that the helper actually fired by adding a print(os.environ["SUPERPOWERS_ROOT"]) after the helper call temporarily.
Task 10: Bash test deletion 阶段 — per-file with 子 agent gate
Section titled “Task 10: Bash test deletion 阶段 — per-file with 子 agent gate”This 任务 has many sub-steps because each candidate-deletion file gets its own 子 agent 验证 + commit. The candidate list comes from the spec’s coverage map. 对于 each entry below:
- Read the bash test file.
- Read the candidate drill 场景 YAML.
- Dispatch a 子 agent with both contents and the comparison prompt.
- Subagent reports per-assertion match table.
- 如果 every bash assertion has a match: delete the bash test, commit.
- 如果 any unmatched: 停止, escalate, do not delete.
Subagent 提示词 模板 (use for every deletion):
You are gating a bash test deletion. The bash test is allegedlycovered by a drill scenario; your job is to verify that claim.
BASH TEST: <paste full contents of bash test>
DRILL SCENARIO: <paste full contents of drill scenario YAML>
Output a markdown table with columns: BASH ASSERTION, DRILL CHECK,STATUS. List EVERY assertion the bash test makes (every grep, every[ ], every test command, every PASS/FAIL emit). For each, find amatching drill check (in verify.assertions or verify.criteria) ormark as UNMATCHED.
After the table, output "VERDICT: SAFE TO DELETE" if every bashassertion has a match, otherwise "VERDICT: KEEP — N unmatchedassertions". Be conservative: if you are uncertain about a match,mark as UNMATCHED.Task 10a: Skill-triggering prompts (6 files)
Section titled “Task 10a: Skill-triggering prompts (6 files)”文件:
- 删除:
tests/skill-triggering/prompts/dispatching-parallel-agents.txt - 删除:
tests/skill-triggering/prompts/executing-plans.txt - 删除:
tests/skill-triggering/prompts/requesting-code-review.txt - 删除:
tests/skill-triggering/prompts/systematic-debugging.txt - 删除:
tests/skill-triggering/prompts/test-driven-development.txt - 删除:
tests/skill-triggering/prompts/writing-plans.txt - Keep:
tests/skill-triggering/run-test.sh,run-all.sh
These 提示词 files are inputs to the bash runner — they don’t have their own assertions. The runner script does the assertion. Map each 提示词 to its drill 场景:
| Prompt | Drill 场景 |
|---|---|
| dispatching-parallel-agents.txt | triggering-dispatching-parallel-agents.yaml |
| executing-plans.txt | triggering-executing-plans.yaml |
| requesting-code-review.txt | triggering-requesting-code-review.yaml |
| systematic-debugging.txt | triggering-systematic-debugging.yaml |
| test-driven-development.txt | triggering-test-driven-development.yaml |
| writing-plans.txt | triggering-writing-plans.yaml |
- 步骤 1: 对于 each 提示词 file, 派发 the 子 agent
对于 提示词 tests/skill-triggering/prompts/<name>.txt and 场景 evals/scenarios/triggering-<name>.yaml, run the 子 agent 提示词 模板 with both contents pasted in. The 子 agent’s job is to verify the 提示词 content matches what the drill 场景’s turns[].intent describes.
如果 all 6 verify SAFE TO DELETE, proceed to step 2. 如果 any verifies KEEP, that one stays and the rest may still proceed.
- 步骤 2: 验证 the runner is still useful for unrelated cases
ls /Users/jesse/Documents/GitHub/superpowers/superpowers/tests/skill-triggering/prompts/如果 the prompts/ 目录 is empty after the planned deletions, also delete tests/skill-triggering/run-test.sh and run-all.sh (they have nothing to run). Otherwise keep the runner.
- 步骤 3: 删除 and commit
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm tests/skill-triggering/prompts/dispatching-parallel-agents.txtgit rm tests/skill-triggering/prompts/executing-plans.txtgit rm tests/skill-triggering/prompts/requesting-code-review.txtgit rm tests/skill-triggering/prompts/systematic-debugging.txtgit rm tests/skill-triggering/prompts/test-driven-development.txtgit rm tests/skill-triggering/prompts/writing-plans.txt# If runner is now orphaned:git rm tests/skill-triggering/run-test.sh tests/skill-triggering/run-all.shrmdir tests/skill-triggering/prompts/ 2>/dev/null || truermdir tests/skill-triggering/ 2>/dev/null || truegit commit -m "tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios)
Subagent verification confirmed each prompt's intent matches itscorresponding drill scenario's turns[].intent. Drill scenarios arecanonical; bash runner has no remaining prompts to drive."Task 10b: explicit-skill-requests (selective deletion)
Section titled “Task 10b: explicit-skill-requests (selective deletion)”文件:
- Inspect: 6 files in
tests/explicit-skill-requests/ - 删除: only those verified to be 100% covered by drill scenarios
- Keep: the rest
Per the spec’s updated coverage map, most of these have no drill counterpart. The likely-deletable ones:
| Bash test | Candidate drill 场景 | Likely outcome |
|---|---|---|
run-test.sh | n/a (runner) | KEEP |
run-all.sh | n/a (runner) | KEEP |
run-claude-describes-sdd.sh | mid-conversation-skill-invocation.yaml | likely DELETE; verify |
run-haiku-test.sh | none (Haiku-specific) | KEEP |
run-multiturn-test.sh, run-extended-multiturn-test.sh | none | KEEP |
prompts/please-use-brainstorming.txt, prompts/use-systematic-debugging.txt | none | KEEP |
- 步骤 1: Read each .sh file and 提示词 to confirm
for f in /Users/jesse/Documents/GitHub/superpowers/superpowers/tests/explicit-skill-requests/*.sh /Users/jesse/Documents/GitHub/superpowers/superpowers/tests/explicit-skill-requests/prompts/*.txt; do echo "=== $f ===" cat "$f" | head -30done- 步骤 2: Dispatch 子 agent for
run-claude-describes-sdd.shonly
使用 the 子 agent 提示词 模板 above with:
-
Bash test content:
tests/explicit-skill-requests/run-claude-describes-sdd.sh -
Drill 场景:
evals/scenarios/mid-conversation-skill-invocation.yaml -
步骤 3: Act on 子 agent verdict
如果 SAFE TO DELETE:
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm tests/explicit-skill-requests/run-claude-describes-sdd.shgit commit -m "tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation)
Subagent verification: every assertion matches a drill check.Other tests in tests/explicit-skill-requests/ are preserved(run-haiku-test.sh, run-*-multiturn-test.sh, please-use-brainstormingand use-systematic-debugging prompts have no drill coverage)."如果 KEEP: 跳过 the deletion, document the gap as a future drill-scenario authoring task.
Task 10c: subagent-driven-dev real-project tests
Section titled “Task 10c: subagent-driven-dev real-project tests”文件:
- Inspect:
tests/subagent-driven-dev/go-fractals/,tests/subagent-driven-dev/svelte-todo/ - Candidate scenarios:
evals/scenarios/sdd-go-fractals.yaml,evals/scenarios/sdd-svelte-todo.yaml
These are entire fixture 目录 with design.md, plan.md, scaffold.sh. Each fixture 目录 was lifted into drill as a fixture under evals/fixtures/.
- 步骤 1: Confirm drill has fixture parity
ls /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/fixtures/sdd-go-fractals/ls /Users/jesse/Documents/GitHub/superpowers/superpowers/evals/fixtures/sdd-svelte-todo/预期: each contains design.md, plan.md, scaffold.sh (or equivalent) matching the source under tests/subagent-driven-dev/.
- 步骤 2: Dispatch 子 agent for each pair
Subagent 提示词: same 模板, with bash “test” being the 目录’s scaffold.sh and (if present) any *.sh runner. Drill 场景 being the corresponding sdd-*.yaml.
- 步骤 3: Act on verdicts
对于 each that returns SAFE TO DELETE:
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm -r tests/subagent-driven-dev/go-fractals/ # or svelte-todogit commit -m "tests: remove subagent-driven-dev/<fixture> (covered by drill sdd-<fixture>)
Subagent verification: drill scenario asserts test suite passespost-execution. Fixture content lives at evals/fixtures/sdd-<fixture>/."如果 both 目录 are removed, also git rm -r tests/subagent-driven-dev/ if it becomes empty.
Task 10d: tests/claude-code/test-document-review-system.sh
Section titled “Task 10d: tests/claude-code/test-document-review-system.sh”Candidate 场景: evals/scenarios/spec-reviewer-catches-planted-flaws.yaml
- 步骤 1: Dispatch 子 agent
Subagent 提示词 模板 with the bash test content and the drill 场景 YAML.
- 步骤 2: Act on verdict
如果 SAFE TO DELETE:
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm tests/claude-code/test-document-review-system.shgit commit -m "tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws)
Subagent verification: every assertion matches a drill check."Task 10e: tests/claude-code/test-requesting-code-review.sh
Section titled “Task 10e: tests/claude-code/test-requesting-code-review.sh”Candidate 场景: evals/scenarios/code-review-catches-planted-bugs.yaml
- 步骤 1: Dispatch 子 agent
Subagent 提示词 模板 with both contents.
- 步骤 2: Act on verdict
如果 SAFE TO DELETE:
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm tests/claude-code/test-requesting-code-review.shgit commit -m "tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs)
Subagent verification: every assertion matches a drill check."Task 10f: tests/claude-code/test-worktree-native-preference.sh
Section titled “Task 10f: tests/claude-code/test-worktree-native-preference.sh”Candidate 场景: evals/scenarios/worktree-creation-under-pressure.yaml
- 步骤 1: Dispatch 子 agent
Subagent 提示词 模板 with both contents.
- 步骤 2: Act on verdict
如果 SAFE TO DELETE:
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit rm tests/claude-code/test-worktree-native-preference.shgit commit -m "tests: remove test-worktree-native-preference.sh (covered by drill worktree-creation-under-pressure)
Subagent verification: every assertion matches a drill check."Task 10g: tests/claude-code/test-subagent-driven-development-integration.sh
Section titled “Task 10g: tests/claude-code/test-subagent-driven-development-integration.sh”Candidate 场景: evals/scenarios/sdd-rejects-extra-features.yaml (partial)
The spec marks this as “almost certainly keep + extend drill 场景”. Don’t delete. Instead:
- 步骤 1: Dispatch 子 agent for the comparison anyway
This documents the gap explicitly.
- 步骤 2: Decide based on 子 agent output
Likely outcome: KEEP with documented gap. The bash test asserts: commit_count >= 3, npm test passes, runs analyze-token-usage.py. The drill 场景 asserts forbidden-exports + reviewer-as-gate. These are mostly disjoint.
- 步骤 3: Document the gap (if KEEP)
添加 a comment at the top of tests/claude-code/test-subagent-driven-development-integration.sh:
# Drill coverage: sdd-rejects-extra-features.yaml covers the YAGNI# enforcement (forbidden exports + reviewer-as-gate). This bash test# additionally asserts: ≥3 task commits, npm test passes, token# analysis runs. Keep until those assertions are added to drill or# explicitly retired.cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add tests/claude-code/test-subagent-driven-development-integration.shgit commit -m "tests: annotate SDD integration test with drill coverage notes
Drill scenario sdd-rejects-extra-features covers the YAGNI subset.This bash test adds: ≥3 commits, npm test, token analysis. Keptuntil drill scenario covers those or they're retired."Task 10h: tests/claude-code/test-subagent-driven-development.sh
Section titled “Task 10h: tests/claude-code/test-subagent-driven-development.sh”This is a meta/describe-skill test (per spec). No drill 场景 covers describe-skill behavior.
- 步骤 1: Confirm by reading the file
cat /Users/jesse/Documents/GitHub/superpowers/superpowers/tests/claude-code/test-subagent-driven-development.sh预期: tests asking the agent to describe SDD skills, not exercise them.
- 步骤 2: KEEP and annotate
添加 at the top:
# No drill coverage: this test asks the agent to *describe* SDD# (asserts that asked-about skills can be summarized correctly).# Drill scenarios test behavior, not description. Kept.cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add tests/claude-code/test-subagent-driven-development.shgit commit -m "tests: annotate SDD describe-skill test with kept-by-design note
Tests agent's ability to *describe* the SDD skill — drill scenariostest behavior, not description. No drill coverage; kept by design."Task 11: Stale-reference scrub
Section titled “Task 11: Stale-reference scrub”文件:
-
Possibly modify:
docs/testing.md,README.md,CLAUDE.md,lefthook.yml,.opencode/INSTALL.md,.codex-plugin/INSTALL.md,.github/*,scripts/* -
Annotate (do not rewrite):
RELEASE-NOTES.md,docs/superpowers/plans/*.md -
步骤 1: Build list of deleted-file paths
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit diff --name-only --diff-filter=D dev..HEAD | sort > /tmp/deleted-paths.txtcat /tmp/deleted-paths.txt- 步骤 2: Search for active 引用
cd /Users/jesse/Documents/GitHub/superpowers/superpowerswhile read -r path; do echo "=== $path ===" grep -rln "$path" \ --include="*.md" \ --include="*.yml" \ --include="*.yaml" \ --include="*.sh" \ --include="*.json" \ --exclude-dir=node_modules \ --exclude-dir=.venv \ --exclude-dir=evals \ --exclude-dir=.git \ .done < /tmp/deleted-paths.txtThis finds every 引用 to a deleted file. Categorize each hit:
| Hit location | Treatment |
|---|---|
docs/testing.md | 更新 — actively documents the test |
README.md (Contributing section) | 更新 if it points at deleted tests |
CLAUDE.md, GEMINI.md, AGENTS.md | 更新 if they 引用 deleted tests |
.github/workflows/*.yml | 更新 — CI shouldn’t try to run deleted tests |
scripts/* | 更新 if they run deleted tests |
.opencode/INSTALL.md, .codex-plugin/INSTALL.md | 更新 if they 引用 deleted tests |
lefthook.yml | 更新 if hooks invoke deleted tests |
RELEASE-NOTES.md | Annotate, don’t rewrite (dated artifact) |
docs/superpowers/plans/*.md | Annotate, don’t rewrite (dated artifact) |
- 步骤 3: 更新 active 引用
对于 each “更新” hit, edit the file to either:
-
移除 the 引用 if the deleted test was the only reason it was named.
-
替换 with a pointer to the drill 场景 (e.g., “see
evals/scenarios/triggering-test-driven-development.yaml”). -
步骤 4: Annotate dated artifacts
对于 each RELEASE-NOTES.md or docs/superpowers/plans/*.md hit, add an inline annotation at the first hit per file:
> Note: this section references `tests/skill-triggering/run-all.sh` and> related bash tests that were lifted into drill scenarios on 2026-05-06> (see `evals/scenarios/triggering-*.yaml`). The references are> preserved as dated artifacts of the work this doc describes.Don’t modify the actual 引用 — they’re historical.
- 步骤 5: Dispatch 子 agent for second-pass scrub
Dispatch a general-purpose 子 agent:
Working directory: /Users/jesse/Documents/GitHub/superpowers/superpowers
These bash test paths were deleted on the current branch; some arealready addressed, but I want a second pair of eyes:
<paste contents of /tmp/deleted-paths.txt>
Search the entire superpowers tree (excluding evals/, node_modules/,.venv/, .git/) for any remaining references to those paths. Reportevery hit with file:line and one-sentence judgment of whether itneeds an update or is fine as-is. Do not modify files; just report.Address every reported hit before continuing.
- 步骤 6: 提交 the active updates
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add -u # picks up edits to existing filesgit commit -m "docs: update references to lifted-and-deleted bash tests
Active references in docs/testing.md, README.md, CI workflows, etc.now point at drill scenarios. Historical references in RELEASE-NOTES.mdand docs/superpowers/plans/*.md are annotated as dated artifacts,not rewritten."Task 12: Top-level docs
Section titled “Task 12: Top-level docs”文件:
-
修改:
docs/testing.md— split into “Plugin tests” + “Skill behavior evals” -
修改:
CLAUDE.md— add evals pointer -
修改:
README.md— add Contributing-section pointer -
修改:
.gitignore— addevals/results/,evals/.venv/,evals/.env -
步骤 1: Split docs/testing.md
The file is currently Claude-Code-centric. Split into two top-level sections.
打开 /Users/jesse/Documents/GitHub/superpowers/superpowers/docs/testing.md and replace the file content with this structure (preserve the 现有 Plugin-test details where applicable):
# Testing Superpowers
Superpowers has two distinct kinds of tests, each in its own directory:
- **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI, with an LLM actor and verifier judging skill compliance.
## Plugin tests
Live in `tests/`. Currently:
- `tests/brainstorm-server/` — node test suite for the brainstorm server JS code.- `tests/opencode/` — bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.- `tests/codex-plugin-sync/` — bash sync verification.- `tests/claude-code/test-helpers.sh`, `analyze-token-usage.py` — utilities used by remaining bash tests.- `tests/claude-code/test-subagent-driven-development.sh` — agent-can-describe-SDD test (no drill counterpart).- `tests/claude-code/test-subagent-driven-development-integration.sh` — extended SDD integration with token analysis (drill covers the YAGNI subset).- `tests/explicit-skill-requests/` — Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.
Run plugin tests via the relevant directory's `run-*.sh` or `npm test`.
## Skill behavior evals
Live in `evals/`. Drill is the harness; scenarios live at `evals/scenarios/*.yaml`. See `evals/README.md` for setup. Quick start:
```bashcd evalsuv syncexport ANTHROPIC_API_KEY=sk-...uv run drill run triggering-test-driven-development -b claudeDrill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).
- [ ] **步骤 2: 更新 CLAUDE.md**
Read the 当前 CLAUDE.md, find a spot near the 项目 structure section, and add:
```markdown## Eval harness
Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`.- 步骤 3: 更新 README.md
Find the Contributing section. 添加 a line:
- Skill-behavior tests use the eval harness at `evals/`. See `evals/README.md` for setup. Plugin-infrastructure tests live at `tests/` and run via the relevant `run-*.sh` or `npm test`.- 步骤 4: 更新 top-level .gitignore
打开 /Users/jesse/Documents/GitHub/superpowers/superpowers/.gitignore and add at the bottom:
# Eval harness — drill ships its own gitignore at evals/.gitignore;# these are belt-and-suspenders entries for tools that don't recurse.evals/results/evals/.venv/evals/.env- 步骤 5: 提交
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit add docs/testing.md CLAUDE.md README.md .gitignoregit commit -m "docs: introduce evals/ as the canonical skill-behavior eval harness
- docs/testing.md split into Plugin tests + Skill behavior evals- CLAUDE.md adds Eval harness section pointing at evals/- README.md Contributing section mentions evals/ alongside tests/- .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files)."Task 13: Re-run smoke checks (regression gate)
Section titled “Task 13: Re-run smoke checks (regression gate)”文件: none (validation only)
- 步骤 1: 运行 drill’s pytest
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsunset SUPERPOWERS_ROOTuv run pytest 2>&1 | tail -5预期:所有测试通过。
- 步骤 2: 运行 cheap drill 场景
set -asource /Users/jesse/Documents/GitHub/prime-radiant-inc/sprout/.envset +acd /Users/jesse/Documents/GitHub/superpowers/superpowers/evalsunset SUPERPOWERS_ROOTuv run drill run triggering-test-driven-development -b claude 2>&1 | tail -3预期: claude: 1 passed, 0 failed, 0 errors. 如果 FAIL, the docs / scrub / deletion 阶段 broke something — bisect over the recent commits.
- 步骤 3: 运行 remaining plugin tests that survived
cd /Users/jesse/Documents/GitHub/superpowers/superpowers/tests/brainstorm-servernode server.test.js 2>&1 | tail -3预期: Results: 25 passed, 0 failed.
Task 14: Final adversarial 审查
Section titled “Task 14: Final adversarial 审查”文件: none (审查 only; 子 agent dispatches)
- 步骤 1: Build the diff for reviewers
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit log --oneline dev..HEADgit diff dev..HEAD --statCapture both outputs to share with reviewers.
- 步骤 2: Dispatch two parallel subagents
使用 the Agent tool with two parallel calls. Same 提示词 to both, with adversarial framing:
Adversarial review competition: 5 points to whoever finds the mostlegitimate issues. You're competing against a parallel reviewerassigned the identical task.
**Branch:** f/evals-lift, in /Users/jesse/Documents/GitHub/superpowers/superpowers**Base:** dev (currently b4363df)**Spec:** docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md
This branch lifts the obra/drill repo into superpowers/evals/ anddeletes redundant bash tests that drill scenarios cover. Two prioradversarial reviews caught issues at the spec stage; this is thepost-implementation review.
Run: git log --oneline dev..HEAD; git diff dev..HEAD --stat
Look hard at:1. Did the rsync-with-excludes actually exclude what it claimed? (find evals -name '.git' -type d should return nothing)2. Does the lift commit message point at a real commit in obra/drill?3. Does the SUPERPOWERS_ROOT helper actually default correctly when the env var is unset? (cd evals && unset SUPERPOWERS_ROOT && uv run drill list — does it work?)4. For each deleted bash test, does the corresponding drill scenario actually verify what the bash test asserted? Spot-check by reading the scenario YAML.5. Are there active references in docs/, .github/, scripts/, lefthook.yml that still point at deleted bash test paths?6. Did the drill pytest suite get updated for the new env-var contract, and does it pass?7. Did the smoke scenario actually get run after path changes?8. Is the drill repo unchanged? (cd ../drill && git status)
Verify before claiming. If you assert "X is broken", check on diskfirst. Confidently-wrong claims count negatively.
Report format: numbered list, each with severity (critical/important/minor/nitpick) and one-sentence explanation with file:line. Lead withmost serious. Cap at ~600 words.- 步骤 3: Address findings
对于 each legitimate finding from either 审查者, fix in a separate commit. Re-run smoke checks (Task 13) after fixes.
- 步骤 4: Declare a winner
Per the cross-platform PR pattern, count legitimate findings (false positives count negatively). Acknowledge the winner in your reply summary.
Task 15: Push and open PR
Section titled “Task 15: Push and open PR”文件: none
- 步骤 1: Push the branch
cd /Users/jesse/Documents/GitHub/superpowers/superpowersgit push -u origin f/evals-lift- 步骤 2: 打开 PR against dev with full description
gh pr create \ --base dev \ --head f/evals-lift \ --reviewer arittr \ --title "Lift drill into superpowers as evals/ harness" \ --body "$(cat <<'EOF'## What problem are you trying to solve?
Drill — the standalone Python skill-compliance benchmark at obra/drill — is already the de facto eval harness for superpowers. The PRI-1397 commit series lifted ~22 bash tests into drill scenarios, and the most recent superpowers commit (a2292c5) explicitly removed a redundant bash test with the message "replaced by drill behavioral coverage". Drill is a sibling repo today, requiring contributors to clone two checkouts and set SUPERPOWERS_ROOT manually. This PR completes the migration: drill becomes superpowers/evals/.
## What does this PR change?
- Lifts the obra/drill repo into superpowers as `evals/`, with explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal). The lift commit records the source SHA.- Adds a `_set_superpowers_root_default()` helper to drill/cli.py so SUPERPOWERS_ROOT defaults to the parent of evals/ — no manual env-var setup.- Drops SUPERPOWERS_ROOT from required_env in codex.yaml/gemini.yaml (the helper supplies it). Claude*.yaml keep it because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.- Deletes redundant bash tests under tests/skill-triggering/, tests/explicit-skill-requests/, tests/subagent-driven-dev/, and tests/claude-code/ — gated per-file by a subagent that compared each bash test's assertions to its drill scenario's verify block. Anything not 100% covered was kept.- docs/testing.md split into Plugin tests + Skill behavior evals.- README.md Contributing and CLAUDE.md gain pointers to evals/.
## Is this change appropriate for the core library?
Yes. Cross-runtime evaluation is core to superpowers, the migration to drill scenarios was already underway in this repo, and the eval harness needs to be discoverable in-tree to be findable.
## What alternatives did you consider?
- Vendored copy + sync script (drill repo continues independently). Rejected: divergence risk; single-source-of-truth wins.- git subtree merge (preserves drill history in-tree). Rejected: superpowers' git history grows by 50+ commits, the merge commit is ugly, subtrees are operationally heavy.- Keep drill as a sibling repo and just polish docs. Rejected: doesn't solve the discoverability problem.
## Does this PR contain multiple unrelated changes?
No — every change supports "drill is now evals/ inside superpowers". Multiple commits for atomicity (verbatim copy, env helper, YAML updates, docs) but one direction.
## Existing PRs
- [x] I have reviewed all open AND closed PRs for duplicates or prior art- Related PRs: #1486 (obra/superpowers cross-platform PR — independent; no shared file changes besides README, which has no overlap)
## Environment tested
| Harness | Version | Model | Model ID ||---------|---------|-------|----------|| Claude Code | local install | Opus | claude-opus-4-7 (1M context) |
Drill's own pytest suite passes from the new location. `triggering-test-driven-development` drill scenario passes from `evals/` after the path-default changes. (Larger drill sweep deferred to release-cadence runs per the spec's deferred-CI policy.)
## Evaluation
- Initial prompt: see linked spec (`docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md`).- Drill's own pytest suite passes.- One drill scenario re-run from the new location end-to-end (proves the SUPERPOWERS_ROOT default works).- Per-deleted-file subagent verification recorded in each deletion commit's message.
## Rigor
- [x] If this is a skills change: this is not a skills change; it's a tooling/infrastructure migration. No behavior-shaping content modified.- [x] Adversarial pressure-tested: two parallel reviewers on the spec; final adversarial pre-PR review on the implementation; spec already corrected for findings before implementation began.- [x] Did not modify carefully-tuned content.
## Human review
- [x] A human has reviewed the COMPLETE proposed diff before submission
## Action items after merge
1. Archive obra/drill on GitHub (mark read-only, add README pointer to obra/superpowers/evals/).2. The spec lists CI integration, scenario co-location with skills, and Python package rename as deferred work. Open issues for any of these you want tracked.EOF)"- 步骤 3: Confirm PR opened
gh pr view --web预期: browser opens to the 新 PR. Take a screenshot or note the URL for follow-up.
验证 checklist (run after Task 15)
Section titled “验证 checklist (run after Task 15)”-
git log --oneline dev..HEADshows the expected commits in order - The lift 提交消息 records the source SHA
-
find evals -name '.git' -type dreturns no output -
cd evals && unset SUPERPOWERS_ROOT && uv run pytestpasses -
cd evals && unset SUPERPOWERS_ROOT && uv run drill listreturns scenarios -
cd evals && unset SUPERPOWERS_ROOT && uv run drill run triggering-test-driven-development -b claudepasses -
tests/brainstorm-server/server.test.jsstill passes (regression gate for non-LLM tests) -
git diff dev..HEAD docs/superpowers/plans/2026-04-06-worktree-rototill.md docs/superpowers/plans/2026-03-23-codex-app-compatibility.md RELEASE-NOTES.mdshows annotations only, no path rewrites -
cd ../drill && git log --oneline -1shows obra/drill is unchanged from the source SHA recorded in the lift commit - PR body lists the post-merge archival action item