5 min read

Benchmarks bite back at biomedical agents

Benchmarks bite back at biomedical agents
Nº 01 · The Lede bioRxiv Agents · Infrastructure

BiomniBench grades agent reasoning

BiomniBench grades agent reasoning
Fig. IbioRxiv · Filed 15 May 2026.

BiomniBench scores LLM agents on the process, not just the answer, across real-world biomedical research tasks — tool-call order, intermediate reasoning steps, and recovery from dead ends all get graded. Headline finding: agents that ace outcome-only benchmarks routinely score below 50% on process fidelity, meaning they often arrive at right answers through wrong paths. Anchors a new reference benchmark for biomedical agent claims; outcome-only scores stop being sufficient evidence the moment a competitor publishes their BiomniBench numbers.

Read the source

OmniGene-4 opens the router
Fig. IIbioRxiv · Filed 15 May 2026.
Nº 02 bioRxiv Field report

OmniGene-4 opens the router

OmniGene-4 unifies bio-language modeling in a single MoE architecture (mixture of experts — many specialist sub-models routed by an inner controller) with router-level interpretability, so you can see which expert fired on a given sequence or query. Moves bio-foundation models from opaque monoliths toward inspectable systems, narrowing the gap between performance and the auditability regulators are starting to ask for.

Read more
Speculative tool calls cut latency
Fig. IIIarXiv · Filed 15 May 2026.
Nº 03 arXiv Field report

Speculative tool calls cut latency

Async I/O and speculative tool calling let agents fire likely next tool calls before the model finishes deciding, slashing wall-clock latency for interactive loops. Collapses the responsiveness gap between batch-style research agents and real-time lab-instrument or clinical-assist surfaces — the same plumbing biomedical agents inherit whether or not they ask for it.

Read more
Also Filed · Two Briefs from the queue
Nº 04 arXiv Agents · Infrastructure

Multi-agent ED digital twin

Multi-agent systems validated against an emergency-department digital twin, simulating triage, resource allocation, and patient flow under load. Moves clinical agent orchestration from whiteboard diagrams to deployment-viable, with a reproducible testbed that hospital IT can point to before signing off.

Read
Nº 05 Anthropic Field report

Anthropic, Gates ink $200M deal

Anthropic partnered with the Gates Foundation on a $200M push to apply Claude to global-health priorities — TB, malaria, maternal health. Resets the reference funder for frontier-LLM deployment in low-resource clinical settings, where compute budgets and data scarcity have kept agents stuck in pilots.

Read

Reply with your discoveries. A human reads them. Forward freely.

Agentic Discovery  ·  Nº Sixteen  ·  15 May 2026

Editor's Note

Today: a process-level benchmark that grades how agents think, not just what they answer, plus a router you can actually peek inside.

 

Nº 01 · The Lede  —  bioRxiv  —  Agents · Infrastructure

BiomniBench grades agent reasoning

BiomniBench grades agent reasoning

Fig. I  bioRxiv · Filed 15 May 2026.

BiomniBench scores LLM agents on the process, not just the answer, across real-world biomedical research tasks — tool-call order, intermediate reasoning steps, and recovery from dead ends all get graded. Headline finding: agents that ace outcome-only benchmarks routinely score below 50% on process fidelity, meaning they often arrive at right answers through wrong paths. Anchors a new reference benchmark for biomedical agent claims; outcome-only scores stop being sufficient evidence the moment a competitor publishes their BiomniBench numbers.

Read the source →

Why it matters

Process-level evaluation becomes the new floor for biomedical agent credibility — vendors who only report end-task accuracy are now visibly hiding something, and the debate shifts from 'does it work?' to 'does it work for defensible reasons?'

 

Nº 02  —  bioRxiv  —  Field report

OmniGene-4 opens the router

Fig. II  bioRxiv · Filed 15 May 2026.

OmniGene-4 opens the router

OmniGene-4 unifies bio-language modeling in a single MoE architecture (mixture of experts — many specialist sub-models routed by an inner controller) with router-level interpretability, so you can see which expert fired on a given sequence or query. Moves bio-foundation models from opaque monoliths toward inspectable systems, narrowing the gap between performance and the auditability regulators are starting to ask for.

Read more →

 

Nº 03  —  arXiv  —  Field report

Speculative tool calls cut latency

Fig. III  arXiv · Filed 15 May 2026.

Speculative tool calls cut latency

Async I/O and speculative tool calling let agents fire likely next tool calls before the model finishes deciding, slashing wall-clock latency for interactive loops. Collapses the responsiveness gap between batch-style research agents and real-time lab-instrument or clinical-assist surfaces — the same plumbing biomedical agents inherit whether or not they ask for it.

Read more →

 

Also Filed  ·  Two Briefs from the queue

Nº 04  —  arXiv  —  Agents · Infrastructure

Multi-agent ED digital twin

Multi-agent systems validated against an emergency-department digital twin, simulating triage, resource allocation, and patient flow under load. Moves clinical agent orchestration from whiteboard diagrams to deployment-viable, with a reproducible testbed that hospital IT can point to before signing off.

Read →

Nº 05  —  Anthropic  —  Field report

Anthropic, Gates ink $200M deal

Anthropic partnered with the Gates Foundation on a $200M push to apply Claude to global-health priorities — TB, malaria, maternal health. Resets the reference funder for frontier-LLM deployment in low-resource clinical settings, where compute budgets and data scarcity have kept agents stuck in pilots.

Read →

 

· · ·

Reply with your discoveries. A human reads them. Forward freely.