"Build succeeded"
Tool-using agents can misread logs or skip output. The claim sounds plausible, but it might not match what the tool actually returned.
Proofs that check. Traces that bind.
AI scientist systems pair an LLM with tools for search, code execution, evaluation, and write-ups. The reliability bottleneck is procedural: keeping a faithful link between plans, tool outputs, and conclusions over long horizons.
Nala is built for settings where progress must be checkable. Proofs are the clearest case, but the same discipline applies to any tool-using agent.
The agent claims it ran a tool or closed a proof, but the output span does not support the claim.
Long-horizon tasks accumulate stale assumptions. Agents repeat work, skip constraints, or keep acting on falsified beliefs.
Without hard verifiers, convincing prose can masquerade as progress. In math, the verifier must be the arbiter.
If you cannot replay what the agent did, you cannot debug it, trust it, or hand it off as a scientific record.
"In AI scientist agents, the most dangerous failure is not a wrong answer. It is claiming work was done when the trace says otherwise."
Nala is an LLM-agnostic wrapper for scientific workflows that must be verifiable. You bring the model. The wrapper enforces a strict loop with verifier-gated progress and trace-budget audits.
Lean4 decides whether a proof checks. Strawberry decides whether the agent’s narrative is supported by the tool trace. Together they prevent progress theater.
"Bring your own LLM. Keep the lab notebook."
Nala is an AI scientist wrapper for any LLM. It makes the workflow checkable by construction: verifiers gate progress, spans bind claims to evidence, and audits prevent plan or action drift from becoming a convincing story.
See the loop →Any LLM can drive the workflow. The wrapper enforces what must be true before the next step can run.
State the next intended claim and the evidence spans that will justify it before any tool executes.
Run lake build, query Lean, or run an experiment. Tool output is the only source of truth.
Capture tool outputs as spans. Future claims must cite spans, not memory or paraphrase.
Strawberry audits the check-in. If a claim exceeds evidence, the system rejects it and the plan must be rebuilt.
Three enforcement layers keep any LLM honest: verifier-gated progress, trace-budget audits, and a wrapper that blocks further actions until the check-in is consistent with evidence.
Lean4 is the source of truth. If the proof does not check, the system cannot report success.
Strawberry audits claim confidence against cited spans. Overconfident check-ins are rejected.
Nala enforces microplan, execute, and audit. No further tool runs until the check-in passes.
Lean4 is a proof assistant. You write definitions and theorems in a precise language, and Lean checks that each proof step is valid.
If you are new to proof assistants, the key idea is simple: a proof is either accepted by the compiler, or it is not. There is no "mostly proved" state that can be passed off as success.
This matters because it makes "progress" unambiguous: either the proof checks, or it does not. Lean even has an explicit placeholder, sorry, that says "this part is not proved yet".
Not "seems right". Not "passed tests". Lean checks the proof in a small trusted kernel.
sorry lets you keep working, but it is explicitly not a proof. Nala treats it as unfinished work, not success.
theorem gap_cert_169_even : ... := by -- work in progress sorry
Lean will not accept this as a finished proof. Nala should not either.
Large language models (LLMs) often fail in a specific way: they "know" the right answer is somewhere in context, but they do not actually use the evidence and instead confabulate. Strawberry flags claims whose confidence is not supported by the evidence provided.
Strawberry is not asking the model to be polite or cautious. It audits the relationship between a claim and the cited spans: if the model did not extract enough information from its evidence to justify its confidence, the claim is rejected.
In other words: Strawberry is a detector for "knows but doesn't use". When an agent says it ran a tool or closed a proof, Strawberry forces the claim to line up with a verbatim trace.
Tool-using agents can misread logs or skip output. The claim sounds plausible, but it might not match what the tool actually returned.
Nala stores tool outputs as immutable "spans" and requires future claims to cite them. This forces a concrete evidence trail.
The agent's interpretation ("proof closed") is audited against spans. Overconfident claims are rejected.
"The proof compiles. No sorry remains."
[lake build output] warning: declaration uses 'sorry' ...
FLAGGED Reason: claim exceeds evidence
This is the point: Nala prevents "progress theater" by requiring the agent to check in with real outputs.
"If Lean says sorry remains, the agent cannot claim otherwise.
Nala is a tool wrapper (for example, via Model Context Protocol) that gates lake build and lean query behind microplans and evidence-backed check-ins. Any LLM can drive the loop, but no LLM can skip the evidence discipline.
Nala requires a short plan that names the next claim and cites evidence that will justify it.
Tool runs are the only source of truth: build output, queries, errors, and warnings.
Outputs are captured as immutable spans. Claims must cite those spans, not memory.
After every execution, Nala requires an audited check-in before any further tools can run. The next step must be rebuilt from what actually happened.
In real tool-using agents, the most dangerous failures are procedural: "I ran the build", "the lemma exists", "the proof closed". Nala is designed to make those failures impossible to paper over.
If the proof is unfinished, Nala says so. If a build fails, Nala cannot claim success.
Every tool run becomes a span. The full audit trail is persisted in state.json.
Lean4 is just one target. The same "plan -> execute -> audit" loop applies to any agent that uses tools.
Using Nala, a Claude Opus 4 agent produced Lean-checked certificates. Crucially, it could not misreport its progress when proofs were incomplete.
The Erdos-Straus question asks whether every n > 1 admits positive integers x, y, z such that:
theorem gap_cert_97 (k : Nat) : &exists; x y z, ES (420 * k + 97) x y z := by ...
Read this as: for every k, Lean can construct witnesses x, y, z and verify the equation. One theorem certifies an infinite family of inputs.
It is a long-standing conjecture in number theory. Many cases are handled by modular certificates (proofs that work for entire residue classes). The remaining gap residues mod 420 (i.e., the remainder when dividing by 420) are hard cases not covered by standard certificates.
These results do not claim a full solution to Erdos-Straus. They are Lean-checked certificates for specific infinite families of inputs.
Lean proofs for: r in {73, 97, 193, 241, 313, 337, 409}
Odd-k branches proven; even-k remains open for: r in {1, 121, 169, 289, 361}
That is: one Lean theorem covers the k = 2m+1 family, while the k = 2m family still contains sorry.
For r = 409, splitting by m mod 11 closes the even-k branch without manual guidance.
Finite enumeration of certificate parameters appears unstable: solutions exist in every bounded search window, but do not stabilize into a bounded menu.
The recommended route forward is to pivot from explicit witness enumeration to existence-of-divisor arguments via Chinese Remainder Theorem (CRT)-style reasoning.
The remaining 5 gap residues require proofs for the even-k cases, not more partial enumerations.
Searches that only work on bounded parameter ranges are not treated as a proof strategy.
Prove the right divisors exist in the right congruence classes, then let Lean check the algebra end-to-end.
As the arguments get harder, the evidence discipline matters more: spans stay immutable and check-ins stay audited.
Want proof automation that is honest by construction?