Nala: LLM-Agnostic AI Scientist

AI Scientist Context

AI scientists fail when the trace is not binding.

AI scientist systems pair an LLM with tools for search, code execution, evaluation, and write-ups. The reliability bottleneck is procedural: keeping a faithful link between plans, tool outputs, and conclusions over long horizons.

Nala is built for settings where progress must be checkable. Proofs are the clearest case, but the same discipline applies to any tool-using agent.

Selected research (for context): The AI Scientist, ReAct, AgentBench, and Plan-and-Act.

Procedural hallucination

The agent claims it ran a tool or closed a proof, but the output span does not support the claim.

Plan and action drift

Long-horizon tasks accumulate stale assumptions. Agents repeat work, skip constraints, or keep acting on falsified beliefs.

Unverifiable outputs

Without hard verifiers, convincing prose can masquerade as progress. In math, the verifier must be the arbiter.

Non-reproducible traces

If you cannot replay what the agent did, you cannot debug it, trust it, or hand it off as a scientific record.

"In AI scientist agents, the most dangerous failure is not a wrong answer. It is claiming work was done when the trace says otherwise."

Audit the process, not just the output.

Nala is an LLM-agnostic wrapper for scientific workflows that must be verifiable. You bring the model. The wrapper enforces a strict loop with verifier-gated progress and trace-budget audits.

Lean4 decides whether a proof checks. Strawberry decides whether the agent’s narrative is supported by the tool trace. Together they prevent progress theater.

See the loop See results

The Take-Home

"Bring your own LLM. Keep the lab notebook."

Nala is an AI scientist wrapper for any LLM. It makes the workflow checkable by construction: verifiers gate progress, spans bind claims to evidence, and audits prevent plan or action drift from becoming a convincing story.

See the loop →

A scientific agent is only as reliable as its trace. Nala forces a disciplined sequence of microplans, tool runs, evidence capture, and audited check-ins so conclusions stay coupled to what actually happened.

How It Works

The Nala loop, in four steps.

Any LLM can drive the workflow. The wrapper enforces what must be true before the next step can run.

01

Microplan the next claim

State the next intended claim and the evidence spans that will justify it before any tool executes.

02

Execute a single tool step

Run lake build, query Lean, or run an experiment. Tool output is the only source of truth.

03

Persist immutable evidence

Capture tool outputs as spans. Future claims must cite spans, not memory or paraphrase.

04

Audit, then replan

Strawberry audits the check-in. If a claim exceeds evidence, the system rejects it and the plan must be rebuilt.

Design Principles

Built to prevent process failures.

Three enforcement layers keep any LLM honest: verifier-gated progress, trace-budget audits, and a wrapper that blocks further actions until the check-in is consistent with evidence.

✓

Verifier-gated progress

Lean4 is the source of truth. If the proof does not check, the system cannot report success.

∑

Trace-budget audits

Strawberry audits claim confidence against cited spans. Overconfident check-ins are rejected.

⚙

Auditable state machine

Nala enforces microplan, execute, and audit. No further tool runs until the check-in passes.

Step 1: Lean4

What Lean4 does: it is a compiler for mathematics.

Lean4 is a proof assistant. You write definitions and theorems in a precise language, and Lean checks that each proof step is valid.

If you are new to proof assistants, the key idea is simple: a proof is either accepted by the compiler, or it is not. There is no "mostly proved" state that can be passed off as success.

This matters because it makes "progress" unambiguous: either the proof checks, or it does not. Lean even has an explicit placeholder, sorry, that says "this part is not proved yet".

In practice, teams run lake build. If sorry remains, an honest report must say "unfinished".

Machine-checked proofs

Not "seems right". Not "passed tests". Lean checks the proof in a small trusted kernel.

sorry is a red flag

sorry lets you keep working, but it is explicitly not a proof. Nala treats it as unfinished work, not success.

Example

theorem gap_cert_169_even : ... := by
  -- work in progress
  sorry

Lean will not accept this as a finished proof. Nala should not either.

Step 2: Strawberry

What Strawberry does: catch hallucinated process.

Large language models (LLMs) often fail in a specific way: they "know" the right answer is somewhere in context, but they do not actually use the evidence and instead confabulate. Strawberry flags claims whose confidence is not supported by the evidence provided.

Strawberry is not asking the model to be polite or cautious. It audits the relationship between a claim and the cited spans: if the model did not extract enough information from its evidence to justify its confidence, the claim is rejected.

In other words: Strawberry is a detector for "knows but doesn't use". When an agent says it ran a tool or closed a proof, Strawberry forces the claim to line up with a verbatim trace.

Failure mode

"Build succeeded"

Tool-using agents can misread logs or skip output. The claim sounds plausible, but it might not match what the tool actually returned.

Evidence

Immutable spans

Nala stores tool outputs as immutable "spans" and requires future claims to cite them. This forces a concrete evidence trail.

Guardrail

Audited check-ins

The agent's interpretation ("proof closed") is audited against spans. Overconfident claims are rejected.

A 10-second example of what gets rejected

Claim

"The proof compiles. No sorry remains."

Evidence span

[lake build output]
warning: declaration uses 'sorry'
...

Verdict

FLAGGED
Reason: claim exceeds evidence

This is the point: Nala prevents "progress theater" by requiring the agent to check in with real outputs.

Step 3: Nala

A wrapper that forces reproducible progress.

Nala is a tool wrapper (for example, via Model Context Protocol) that gates lake build and lean query behind microplans and evidence-backed check-ins. Any LLM can drive the loop, but no LLM can skip the evidence discipline.

No action without an audited plan.Before any tool runs, the agent submits a microplan: short claims with explicit citations.
Every execution forces a check-in.After a tool returns, Nala blocks further tool runs until the agent checks in with the real output (and the check-in passes audit).
Check-ins are audited.Strawberry rejects interpretations that exceed the evidence in the spans.

01

Microplan first

Nala requires a short plan that names the next claim and cites evidence that will justify it.

Forces explicit assumptions
Prevents vague "try stuff" loops
Sets up auditable progress

02

Run Lean

Tool runs are the only source of truth: build output, queries, errors, and warnings.

Lean is the final arbiter
Build status is evidence
No reading tea leaves

03

Check in with spans

Outputs are captured as immutable spans. Claims must cite those spans, not memory.

Immutable evidence trail
Line-level citations
Audit-friendly by default

04

Audit, then replan

After every execution, Nala requires an audited check-in before any further tools can run. The next step must be rebuilt from what actually happened.

Prevents stale assumptions
Stops "it should work" reasoning
Repeat until the proof closes

Why This Matters

Proof assistants catch logical errors. Strawberry catches process errors.

In real tool-using agents, the most dangerous failures are procedural: "I ran the build", "the lemma exists", "the proof closed". Nala is designed to make those failures impossible to paper over.

Benefit 1

Honest progress

If the proof is unfinished, Nala says so. If a build fails, Nala cannot claim success.

Benefit 2

Replayable evidence

Every tool run becomes a span. The full audit trail is persisted in state.json.

Benefit 3

Reusable pattern

Lean4 is just one target. The same "plan -> execute -> audit" loop applies to any agent that uses tools.

Results

Automating Erdos-Straus gap certificates in Lean.

Using Nala, a Claude Opus 4 agent produced Lean-checked certificates. Crucially, it could not misreport its progress when proofs were incomplete.

The Erdos-Straus question asks whether every n > 1 admits positive integers x, y, z such that:

4/n = 1/x + 1/y + 1/z

unit fractions

A certificate, in Lean

theorem gap_cert_97 (k : Nat) :
  &exists; x y z, ES (420 * k + 97) x y z := by
  ...

Read this as: for every k, Lean can construct witnesses x, y, z and verify the equation. One theorem certifies an infinite family of inputs.

It is a long-standing conjecture in number theory. Many cases are handled by modular certificates (proofs that work for entire residue classes). The remaining gap residues mod 420 (i.e., the remainder when dividing by 420) are hard cases not covered by standard certificates.

These results do not claim a full solution to Erdos-Straus. They are Lean-checked certificates for specific infinite families of inputs.

Fully machine-checked

7 residues

Lean proofs for: r in {73, 97, 193, 241, 313, 337, 409}

Partial progress

5 residues

Odd-k branches proven; even-k remains open for: r in {1, 121, 169, 289, 361}

That is: one Lean theorem covers the k = 2m+1 family, while the k = 2m family still contains sorry.

Key finding

For r = 409, splitting by m mod 11 closes the even-k branch without manual guidance.

What blocked the remaining cases

Finite enumeration of certificate parameters appears unstable: solutions exist in every bounded search window, but do not stabilize into a bounded menu.

Next Steps

Where Nala goes next: existence proofs, not menu search.

The recommended route forward is to pivot from explicit witness enumeration to existence-of-divisor arguments via Chinese Remainder Theorem (CRT)-style reasoning.

01

Close the even-k branches

The remaining 5 gap residues require proofs for the even-k cases, not more partial enumerations.

02

Avoid unstable finite menus

Searches that only work on bounded parameter ranges are not treated as a proof strategy.

03

Use CRT-style existence

Prove the right divisors exist in the right congruence classes, then let Lean check the algebra end-to-end.

04

Keep the audit trail

As the arguments get harder, the evidence discipline matters more: spans stay immutable and check-ins stay audited.

An AI scientist wrapper for any LLM.

AI scientists fail when the trace is not binding.

Procedural hallucination

Plan and action drift

Unverifiable outputs

Non-reproducible traces

Audit the process, not just the output.

The Nala loop, in four steps.

Microplan the next claim

Execute a single tool step

Persist immutable evidence

Audit, then replan

Built to prevent process failures.

Verifier-gated progress

Trace-budget audits

Auditable state machine

What Lean4 does: it is a compiler for mathematics.

Machine-checked proofs

sorry is a red flag

What Strawberry does: catch hallucinated process.

"Build succeeded"

Immutable spans

Audited check-ins

A 10-second example of what gets rejected

A wrapper that forces reproducible progress.

Microplan first

Run Lean

Check in with spans

Audit, then replan

Proof assistants catch logical errors. Strawberry catches process errors.

Honest progress

Replayable evidence

Reusable pattern

Automating Erdos-Straus gap certificates in Lean.

Fully machine-checked

Partial progress

Where Nala goes next: existence proofs, not menu search.

Close the even-k branches

Avoid unstable finite menus

Use CRT-style existence

Keep the audit trail