HINT: Open a separate LLM window and supply this document to that LLM with a prompt:
Role: Act as my Senior Reviewer following and enforcing the attached PHENIX/cctbx LLM Development Policy. I am using a separate LLM for coding; your job is to audit and guide me.Protocol:Initialize: Ask me for the Phase 1 details: Bug description, affected files, and if an ARCHITECTURE.md exists.Plan Audit: Critique my coding plan for flaws, the 3-file limit, and test coverage before I start.Verification: After each code change, remind me to run ast.parse() and relevant PHENIX validation tools (e.g., phenix.molprobity).Audit: Ensure no placeholder comments are used and conduct the "What Did We Miss?" check at the end. Acknowledge and ask the Phase 1 questions to begin.
Then the second LLM will guide you through this whole process.
This document serves as both the official CCTBX LLM Developer Policy and the implementation guide.
This document describes the workflow that PHENIX/cctbx developers follow when using LLMs to write or modify code in the PHENIX/cctbx ecosystem. The goal is not to restrict LLM use — it is to ensure that LLM-generated code meets the same standards as hand-written code: correct, tested, portable, and maintainable.
The quick-start below gets you going immediately. Part 1 covers general principles that apply to any LLM and any codebase, with the project-specific prompts shown in context where each principle is introduced. Part 2 covers the project workflow: rules of engagement, which guideline files to attach and when, and checklists for the human side of the process. The appendices contain a file reference, the full text of the opening prompt, a prompt reference card, and a pull request checklist.
All prompt files live in libtbx/langchain/docs/prompts/. The small ones are reproduced in full throughout this document so you can read the guide without opening anything else.
If you want to start using these files right now, here is the recipe. Everything referenced here is explained in detail in Parts 1 and 2.
For what each file contains, see the File Reference table at the end of this document.
An LLM is not a compiler, not a search engine, and not a junior developer. The closest analogy is a well-read colleague who has seen a vast amount of code but has never run any of it. This has practical consequences:
Your job is to be the verifier and navigator. The LLM generates; you validate. This division of labor is the foundation of everything else in this document.
A special note on scientific code. LLMs have no understanding of physical correctness. They will write code that compiles and passes naive tests but is scientifically wrong — wrong units, stale caches, numerically unstable formulas. In crystallographic code, be especially alert to: unit confusion (distances in Å vs fractional coordinates, R-factors as fractions vs percentages, B-factors in Ų), spatial caches that must invalidate on symmetry changes, and division by values that can be zero in pathological cases. The LLM will not catch these; you must. See §1.5 for the full verification checklist.
LLMs have a fixed context window — the total amount of text they can see at once (the conversation so far, any attached files, and their own response). Everything the LLM needs to know must be inside that window or it effectively does not exist.
What to provide:
What to leave out:
Practical tips:
The quality of what you get is directly proportional to the clarity of what you ask for. Vague requests produce vague code.
Good request:
Add a _safe_float() helper to metric_evaluator.py that converts string or None values to float, returning None on failure. Then use it at lines 352, 392, and in calculate_improvement_rate() to coerce r_free values before arithmetic. Follow the same pattern as _safe_float() in metrics_analyzer.py.
Weak request:
Fix the type error in metric_evaluator.py.
The good request tells the LLM what to build, where to put it, what pattern to follow, and which call sites to update. The weak request forces the LLM to guess all of that.
Structure for effective requests:
Never let an LLM jump straight to writing code for a non-trivial task. The cost of a bad plan executed thoroughly is much higher than the cost of planning time.
The recommended cycle:
START ──► PLAN ──► REVIEW ──► IMPLEMENT ──► VERIFY ──► WHAT DID WE MISS? │ │ │ │ │ │ Attach: Paste: Give plan LLM writes You run Ask LLM: CCTBX PLAN_ + REVIEW code, tests, any side Guide PROMPT .txt to a updates read diffs, effects, WORKFLOW + prob. second LLM HANDOFF check edges callers, ARCH.md desc. each step or gaps? code Paste: WORKFLOW_PROMPT
Step 1 — Plan. Ask the LLM to produce a written plan: problem description, approach, implementation steps, risks, and a test plan. The test plan is not an afterthought — it is a first-class deliverable equal in importance to the implementation steps. It must specify which test files will be created or modified, one entry per test function (name, what it exercises, what edge cases it covers), and how the tests verify correctness (specific expected values, not just “is not None”). Every implementation step must have corresponding tests identified in the plan. Missing test coverage is the most common plan deficiency — pay particular attention to it during review.
If you have an ARCHITECTURE.md (see §1.11), provide it here so the plan accounts for the system’s structure. If you don’t have one and the task is substantial, ask the LLM to draft one first — review and correct it, then use it as input to the plan. When you are ready to request the plan, use this prompt:
Please make a plan for fixing these problems. Include a full discussion of the problem, the overall approach, the details of the approach, implementation plan, and risks involved and their mitigation. write as md
(This is PLAN_PROMPT.txt.)
Step 2 — Cross-review. For important changes, give the plan to a second LLM for critique. A different model catches different blind spots. Use this prompt with the second LLM:
Gemini, you are a senior engineer tasked with **critically reviewing this plan**. Your goal is to find every potential flaw, unclear assumption, or workflow violation. Be specific and constructive; do not give generic praise. For each step: 1. Identify issues or risks 2. Explain why it is a problem 3. Suggest improvements or alternatives Limit to 1–2 paragraphs per step. Do not provide general compliments.
(This is REVIEW.txt. It names Gemini but works with any LLM — the important thing is a different model from the one that wrote the plan.)
Step 3 — Revise. Feed the critique back to the first LLM. Limit to 1–2 revision rounds — diminishing returns set in fast.
Step 4 — Implement. Only now does the LLM write code, following the agreed plan step by step. Each step includes both the code change and its tests — tests are written alongside the code, not after it. The 3-file rule applies: each step must modify no more than 3 files. If a change requires more, split it into multiple atomic steps with a verification pass (review + test) between them. The LLM should checkpoint after each step so that work is recoverable if interrupted (see §1.7).
After each step, ask the LLM: “Are there any other fixes or additions you would like to make to this change before we move on?” This is the single most effective quality gate — it catches oversights while the context is still fresh.
Step 5 — Verify. You run the code, run the tests, and confirm correctness. The LLM cannot do this for you (see §1.5).
Step 6 — Ask “what did we miss?” Before closing out, ask the LLM to consider whether the changes have unaddressed side effects, untouched callers, missing test coverage, or downstream consequences. Do not skip this — it catches problems that verification (§1.5) does not. It is especially valuable with a second LLM that wasn’t involved in the implementation. See §1.12.
This cycle prevents the most expensive failure mode: the LLM confidently building out an approach that was wrong from the start.
This is the single most important habit to develop.
Never trust LLM output without verification.
This applies even when:
Concrete verification steps:
Project-specific checks (see §2.4 for the full checklist):
Understanding how LLMs fail helps you catch problems before they reach your codebase.
Hallucinated APIs. The LLM invents function names, keyword arguments, or library features that don’t exist. It does this with complete confidence. Always verify that the API calls it writes actually match the real signatures. (This is especially common with PHENIX — see “Don’t invent parameters” in CCTBX_LLM_PROGRAMMING_GUIDELINES.md §3.)
Confident wrongness. When an LLM makes an error and you point it out, it will sometimes apologize and produce a new answer that is equally wrong but differently worded. If a correction round doesn’t converge, re-state the problem from scratch rather than asking the LLM to “try again.”
Sycophantic agreement. If you suggest a wrong approach, many LLMs will agree with you rather than push back. Don’t use leading questions like “shouldn’t we use X here?” if you’re genuinely unsure. Instead ask “what are the options for this?” and evaluate the answer.
Drift over long conversations. As the conversation gets longer, the LLM gradually loses track of earlier instructions and constraints. If you notice the LLM forgetting your coding standards or making mistakes it avoided earlier, it’s time to start a fresh session with a clean summary. (See §1.7.)
Unasked-for changes. LLMs sometimes “improve” code you didn’t ask them to touch — renaming variables, refactoring adjacent functions, removing comments they consider redundant. Always diff the output against the original and reject unrelated changes.
Incomplete changes. The opposite problem: the LLM changes the function but forgets to update the callers, the tests, or the documentation. Ask explicitly: “what else needs to change to keep everything consistent?”
LLM sessions are inherently fragile. They can be interrupted by context length limits, timeouts, network issues, or simply because you closed the browser tab. Plan for this.
After each meaningful step, the LLM writes a checkpoint file that captures: what was done, what remains, what decisions were made, and what the next step is. On this project we use HANDOFF.json. A blank template:
{
"current_task": "UNKNOWN",
"status": "not_started",
"relevant_files": [],
"completed_steps": [],
"remaining_steps": [],
"decisions": [],
"assumptions": [],
"open_questions": [],
"last_stable_state": null,
"next_step": "analyze task"
}
The LLM fills in and updates every field as work progresses. The critical property: a new LLM session must be able to resume using only HANDOFF.json and the code, with no other context about what happened before. The full rules for when and how to update it are in WORKFLOW.md (see §2.1).
If the LLM is not updating HANDOFF.json after each step, remind it. This is the most common workflow violation.
Shorter sessions with clear handoffs are more reliable than marathon sessions. After about 15–20 back-and-forth exchanges, consider wrapping up and starting fresh. Each new session gets a clean context window and a fresh start on the instructions you provide.
When a session dies (timeout, context limit, crash), do not attempt to resume by asking the LLM to "continue where you left off" — it has no memory of the prior session and will guess (often incorrectly). Instead, follow this exact recovery procedure:
I am resuming a previous session. Read the attached HANDOFF.json and the current state of the code to determine exactly where we left off. Do not guess; verify against the file contents.
The LLM will read HANDOFF.json, tell you where things stand, and propose the next step. Confirm before it proceeds.
Shorthand alternative. If you are confident that HANDOFF.json is current and complete, you can use the shorter CONTINUE_PROMPT.txt instead:
You were interrupted. Do NOT continue from memory. 1. Read the attached HANDOFF.json to understand the state. 2. Read the code archive to see what was actually implemented. 3. Summarize where we are and what the very next micro-step is. If anything is unclear or missing, ask before proceeding.
(This is CONTINUE_PROMPT.txt.) Use the full verification prompt above when you have any doubt about the checkpoint's accuracy.
Always provide:
Do not rely on the LLM “remembering” anything from a prior session. Even if the LLM platform has a memory feature, treat each coding session as standalone.
Different LLMs have different strengths and weaknesses. Using more than one is genuinely useful in two situations:
Plan review. Have one LLM write a plan and a different LLM critique it. They make different kinds of mistakes, so the reviewer will often catch things the author missed. This is the most effective multi-LLM pattern. The REVIEW.txt prompt (shown in §1.4) is designed for exactly this — give it plus the plan to the second LLM.
Stuck debugging. If one LLM has gone in circles on a bug, describe the problem fresh to a different model. A different set of biases often finds a different (and sometimes correct) path.
There is no need to use multiple LLMs for routine coding tasks. The overhead of managing multiple sessions isn’t worth it for straightforward changes.
Good at:
Bad at:
Review LLM-generated code the same way you would review a pull request from a new team member who is talented but unfamiliar with your project:
If the answer to any of these is “no,” send it back for revision — just as you would with a human author.
LLMs work from the files you give them. If you hand them three source files with no explanation of how those files fit into the larger system, they will guess — and they will guess wrong. An architecture document eliminates the most damaging category of guessing: guessing about structure.
What to put in ARCHITECTURE.md:
This does not need to be long. A one-page document with a component list and a data flow description is far more useful than nothing. The goal is to prevent the LLM from making structurally wrong changes — like putting server-only logic in client code, or modifying a file without updating the three other files that must stay in sync with it.
When to provide it:
When to update it:
After the code is written and the tests pass, there is one more step before you close out: ask the LLM whether anything was missed. This is Step 6 in the Plan–Review–Implement cycle (§1.4). Do not skip it — it is cheap to do and occasionally catches expensive mistakes.
This catches a class of problems that verification (§1.5) does not: side effects, downstream consequences, and unstated assumptions that neither you nor the LLM thought to check. It is cheap to do and occasionally catches expensive mistakes.
How to do it. After the implementation is complete and tests pass, ask:
We just made these changes: [brief summary or diff]. Are there any side effects, callers, downstream consumers, edge cases, documentation, or tests that we haven’t addressed? Think carefully about what could break that we haven’t considered.
Why a second LLM is especially useful here. The LLM that wrote the code has a blind spot: it has already convinced itself the approach is correct. A fresh LLM (or a fresh session) has no such commitment. Give it the changed files, the original files, and the question above. It will often spot things the implementing LLM overlooked — a caller that passes None, a serialization path that expects the old field name, a test file that hard-codes an assumption the change just invalidated.
What to look for in the answer:
You don’t need to act on every suggestion — the LLM may flag things that aren’t actually problems. But reviewing the list takes a minute and occasionally saves hours.
The sections above focus on modifying existing code — fixing a bug, adding a feature, refactoring a module. But LLMs can also help build a substantial system from the ground up, provided you approach the work in the right order. The key insight is that architecture and tests come first, code comes last.
A large LLM-assisted project produces five deliverables, roughly in this order:
You may also want an OVERVIEW.md — a short, high-level description of what the system does, who it is for, and how the pieces fit together. This is useful for onboarding new team members and for giving LLMs a quick orientation at the start of a session.
Before writing any code, invest a full session (or more) in producing ARCHITECTURE.md. This is the single most important document for a large project. Without it, the LLM will make structural decisions ad hoc, and those decisions will conflict across sessions.
Start by describing the system to the LLM: what it does, who uses it, what the main operations are, and what the constraints are (performance, portability, compatibility with existing code). Then ask the LLM to draft an architecture document. Use this prompt:
Based on the requirements I’ve described, draft an ARCHITECTURE.md for this system. Include: - Major components and their responsibilities - Data flow between components - Key interfaces (what each component exposes) - Design constraints and invariants - File/directory structure Do not write any code. Focus on structure.
Review the draft critically. The architecture is the foundation for everything that follows — errors here propagate into every file. Have a second LLM review it (using REVIEW.txt). Revise until you are confident.
What makes a good architecture document:
With ARCHITECTURE.md in hand, implement the system one component at a time, following the Plan–Review–Implement cycle (§1.4) for each component. The critical discipline: every implementation step produces both code and tests. Do not build the whole system and then write tests — by that point the LLM has forgotten the edge cases, and you have too many untested interactions.
For each component:
Test coverage for large projects. The test suite is not a formality — it is the primary defense against the LLM introducing regressions in later sessions. As the project grows beyond what fits in a single context window, the LLM will inevitably lose track of decisions made in earlier sessions. The tests are what catch the resulting mistakes. Aim for:
Once the system is implemented and the tests pass, produce the documentation. LLMs are good at drafting documentation from working code — this is one of their strengths (§1.9). But the documentation must be reviewed as carefully as the code, because the LLM will confidently describe behavior that doesn’t match the implementation.
OVERVIEW.md — Write this first. One to two pages: what the system does, who it is for, how the major pieces fit together, and how to get started. This is the document you attach to future LLM sessions to orient the LLM quickly. It is also the document a new team member reads first.
DEVELOPER_GUIDE.md — How to work on the code. Covers: how to set up a development environment, how to run the tests, the project’s coding conventions (or a reference to the coding guidelines file), how the architecture maps to the directory structure, and any non-obvious patterns that a new developer would need to understand. Ask the LLM to draft this from the architecture document and the source code, then review it for accuracy.
USER_GUIDE.md — How to use the system. Covers: installation, configuration, typical workflows, command-line usage or API reference, and troubleshooting. This should be written from the user’s perspective, not the developer’s. If the system has a GUI, include screenshots or describe the interface. Ask the LLM to draft this from the code and the overview, then test every instruction yourself.
A large project will span many LLM sessions — potentially dozens. The advice in §1.7 applies, but with extra emphasis:
It is tempting to skip the architecture phase and start coding immediately. Don’t. The architecture document pays for itself many times over:
Similarly, it is tempting to defer tests until the end. Don’t. Tests written alongside the code catch bugs when they are cheap to fix. Tests written after the fact tend to be shallow (they test what the code does, not what it should do) and miss the edge cases that the LLM forgot about three sessions ago.
When you start a session using WORKFLOW_PROMPT.txt (the full text is in the Appendix), the LLM is instructed to follow these rules. Knowing them helps you understand why the LLM behaves the way it does — and when to correct it.
Small steps (the 3-file rule). The LLM must modify at most 3 files per implementation step. If a change requires more, it must be split into atomic steps with a verification pass between them. Work proceeds in micro-steps: understand the problem, identify files, create a plan, implement one change, validate, repeat.
No placeholders. The LLM must provide full copies of all changed files, never diffs-only or “# ... rest of code stays the same ...” comments. You should be able to drop the output directly into the codebase. If the LLM produces output containing placeholder comments, reject it immediately and ask for the complete code. This is a hard rule, not a preference — placeholder comments silently delete code when pasted. See §1.5 for details.
Mandatory checkpoints. The LLM must update HANDOFF.json after every micro-step — after reasoning, after each code change, before running tests, after test results, and before ending its response. This is the rule that gets violated most often. If you notice the LLM batching multiple changes without checkpointing, remind it.
Plan before code. The LLM must write a plan (max 6–8 steps, each independently executable) and have it reviewed before writing any code. See §1.4.
Archive output. When delivering changes, the LLM produces a tar.gz archive preserving the original directory structure, plus the updated HANDOFF.json and a short explanation for each modified file.
The full workflow rules are in WORKFLOW.md, which the LLM reads at session start.
Attaching the right guideline files gives the LLM the conventions and pitfalls specific to the code it will touch. Attaching the wrong ones wastes context window space. Here is when to use each:
Always attach (every session): These are in libtbx/langchain/docs/prompts/.
Attach when working on agent code:
Don’t over-attach. If you are fixing a bug in iotbx/pdb/, you don’t need the agent guidelines. If you are modifying agent/error_classifier.py, you do. Extra guideline files cost context window space that could hold more of the actual code the LLM needs to see.
Here is the full cycle showing which prompt to use at each stage and what the human does at each step:
┌─────────────────────────────────────────────────┐
│ START SESSION │
│ │
│ Attach as files (do NOT paste): │
│ CCTBX_LLM_PROGRAMMING_GUIDELINES.md │
│ WORKFLOW.md │
│ ARCHITECTURE.md (if you have one) │
│ AI_AGENT_LLM_PROGRAMMING_GUIDELINES.md │
│ (only if working on agent code) │
│ code archive │
│ HANDOFF.json (if resuming) │
│ │
│ Paste into chat: │
│ WORKFLOW_PROMPT.txt │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ PLAN │
│ Paste into chat: │
│ PLAN_PROMPT.txt + problem description │
│ Human: Read the plan. Does it make sense? │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ REVIEW (recommended) │
│ In a DIFFERENT LLM session: │
│ Give the plan + REVIEW.txt │
│ Feed critique back to first LLM. │
│ Revise 1–2 rounds. │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ IMPLEMENT │
│ LLM executes plan step by step. │
│ LLM updates HANDOFF.json after each step. │
│ Human: verify each step (§1.5, §2.4) │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ WHAT DID WE MISS? (§1.12) │
│ Ask the LLM (or a second LLM): │
│ any side effects, callers, edge cases, │
│ or downstream consequences we missed? │
│ Review the answer. Act on real issues. │
└────────────────────┬────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────────┐
│ DONE │ │ INTERRUPTED │
│ HANDOFF.json │ │ New session │
│ reflects final │ │ Same file attachments │
│ state. │ │ Paste CONTINUE_PROMPT │
│ Update │ │ (see §1.7) │
│ ARCHITECTURE.md │ │ │
│ if structure │ │ │
│ changed. │ │ │
└──────────────────┘ └──────────────────────────┘
All files live in libtbx/langchain/docs/prompts/.
Attach as files — these are large reference documents. Do NOT paste their contents into the chat; that crowds out the LLM’s reasoning space and leaves less room for your actual code.
Paste into chat — these are short prompts you type or paste at specific moments during a session. They are small enough that pasting is fine.
Carried between sessions — the LLM creates and maintains this file; you save it and provide it to the next session.
This is the opening message you paste to start a session. It is the longest of the session prompts (~55 lines), so it is kept here rather than inline in Part 1.
You are working on a large codebase with strict workflow rules. Your primary goal is NOT just to complete tasks, but to maintain a continuously valid HANDOFF.json so work can resume at any moment if interrupted. ## Core Rules 1. HANDOFF.json must ALWAYS be complete and restartable 2. After EVERY meaningful step, update HANDOFF.json 3. NEVER batch large work without checkpointing 4. Before any long operation (like running tests), update HANDOFF.json 5. If you suspect time/resource limits, STOP and write HANDOFF.json immediately ## Required Workflow At the start: * Read HANDOFF.json (if present) * Summarize current task and state * Propose a step-by-step plan * Ask for any missing files BEFORE coding During work: * Break work into small steps * After each step: * Update HANDOFF.json * Ensure it reflects current truth Before running tests: * Fully update HANDOFF.json with: * changes made * expected outcomes * what success/failure means If interrupted, HANDOFF.json must allow a new session to continue with NO additional context. ## Constraints * Modify at most 3 files per implementation step unless justified * Prefer minimal diffs over rewrites * Preserve public APIs unless explicitly required ## Output Format When you make changes: 1. Show patch/diff 2. Update HANDOFF.json 3. Brief explanation Acknowledge these rules and begin by reading HANDOFF.json or requesting it.
A quick reference for which prompt to use at each point in the workflow.
| When | Prompt |
|---|---|
| Start | Attach guidelines + source files + test files |
| Resuming | Attach same files + HANDOFF.json. Paste: “Read HANDOFF.json and the current code to determine where we left off. Do not guess; verify against file contents.” |
| Before coding | “Create a plan as a markdown file, including a detailed test plan” |
| Plan review | “What could go wrong? Are the tests sufficient?” |
| Each step | “Write the code and its tests together. No more than 3 files.” |
| Each step | “Any other fixes before we move on?” |
| Each step | “Run all relevant tests (new and existing)” |
| Each step | Update HANDOFF.json before proceeding |
| Scientific changes | “Run phenix.model_vs_data / phenix.molprobity / phenix.mtriage on the output” |
| All done | “Run a full audit of all changes” |
| All done | “What did we miss? Any side effects, callers, or gaps?” |
| All done | “Verify every test in the plan has been implemented” |
| Final | “Check changed files for Common Pitfalls” |
Use this checklist when reviewing an LLM-assisted pull request. It reorganizes the session checklist from §2.4 for the commit-time review moment.
Planning and review
Tests
Code quality
Scientific correctness (when applicable)
Process