30 May 2026 6 min read

The agent stack is mature. The science it produces still isn't.

The agent layer for biology grew up this week. CARIBOU runs omics end-to-end, SpatialClaw handles spatial transcriptomics from raw data to niche calls, agents reconstruct PK curves from sparse clinical samples, and an open-source model matches AlphaFold across a billion proteins. Read the dailies and you'd conclude the field crossed a capability threshold.

It did. But the more interesting story is what crossing it exposed. The bottleneck in agentic discovery isn't the agent anymore. It's everything the agent rests on — the memory it keeps between runs, the budget that constrains its experiments, the package supply chain it inherits, the temporal grounding of the knowledge it queries. Capability ran ahead. Trust didn't.

The pipeline question is settled

Stop asking whether agents can run a bioinformatics workflow. They can. CARIBOU chains tool selection, parameter setting, and result interpretation across bulk RNA-seq, scRNA-seq, and variant calling in one loop. SpatialClaw does the same for spatial omics. A grey-box Transformer orchestrated by agents initializes pharmacometric models from sparse PK data — the kind of task that used to require a pharmacometrician and a week. Anthropic's survey of 1,260 social scientists shows the same agent patterns quietly absorbed into adjacent research workflows. The capability is real and it's distributed.

Which means the interesting questions move one layer down. Not 'can the agent do it' but 'what does the agent need to be doing it well, repeatedly, on data that matters.' This week answered that too, by accident, by showing where the new floor cracks.

Where the floor cracks

Four cracks, all visible this week, none owned by any single paper.

Memory. SpatialClaw pairs tool-calling with a persistent memory store because spatial-omics analysis is iterative and an agent that forgets what it tried yesterday is worse than a grad student who forgets. YourMemory shipped a temporal-reasoning memory layer the same week. These aren't coincidence. Multi-step science requires state, and the field is figuring out — in public — that stateless LLM calls don't compose into a research program.

Time. ChronoMedKG time-stamps clinical knowledge because static medical KGs have been quietly poisoning clinical-LLM evaluations: a model retrieves a 2019 guideline against a 2024 case and looks right by yesterday's standard. This is a knowledge-substrate problem that every agent inherits. Most haven't noticed.

Budget. Self-improving protein design under strict oracle budgets reframes the protein-design loop around the thing that actually costs money — the wet-lab assay. Capability without budget discipline is theater. The papers that treat the oracle as scarce will outproduce the ones that don't, because at production scale they're the only ones whose runs close.

Security. A critical vulnerability in a widely-used agent package exposed millions of deployments, including bio and clinical pipelines that pulled the dependency. The moment agents move from notebooks into production, they inherit the software supply chain — and the supply chain has not been audited for what agents now do with it.

What this changes

If the pipeline is settled and the substrate isn't, the competitive question rearranges. A lab that ships an end-to-end omics agent in 2026 is not differentiated. The infrastructure is open, the patterns are known, and open-source structure prediction now covers a billion proteins with weights and training code. What's scarce is the surrounding apparatus: a memory architecture that lets an agent build on its own prior runs, a knowledge graph that knows what year it is, an oracle-budget discipline that survives contact with a real assay queue, a dependency audit that means the agent you ship next quarter isn't tomorrow's CVE.

This is also why the OpenAI geometry result matters even though it's not biology. A frontier model produced an original counterexample to an 80-year-old conjecture — original mathematics, not retrieval. The capability ceiling rose again this week. The gap between what a model can do in principle and what an agent in a lab can be trusted to do in practice widened in the same week. That gap is the work.

The question for next quarter

Stop benchmarking agents on whether they finish the pipeline. Benchmark them on what happens on run 50, after the memory has accumulated, after the oracle budget has been half-spent on dead ends, after a guideline has updated mid-study, after a dependency has been patched. The agent that wins isn't the one with the cleverest reasoning loop. It's the one whose substrate doesn't betray it. Build that, or buy time from someone who has.

Reply with what you're seeing. A human reads them. Forward freely.

AGENTIC ARC

Nº I · week of 25 May 2026 · from Agentic Discovery