4 min read

Open LLMs run the lab bench

Open LLMs run the lab bench
Nº 01 · The Lede bioRxiv Field report

Open LLMs tested as lab orchestrators

Open LLMs tested as lab orchestrators
Fig. IbioRxiv · Filed 18 May 2026.

Open-weight LLMs evaluated as agentic orchestrators for routine biomedical analysis in a new bioRxiv preprint from the Galaxy team, benchmarking models a typical lab can actually self-host against the closed frontier on multi-step pipeline planning. The work scores how reliably each model chains tool calls, recovers from errors, and produces analyses a bench scientist would accept. Results map which open checkpoints clear the bar today and where they still drop calls. Sets the first concrete reference point for self-hosted agentic analysis in biology — the debate over whether labs need frontier API access for orchestration now has numbers attached.

Read the source

Physics scoring rescues AI binders
Fig. IIbioRxiv · Filed 18 May 2026.
Nº 02 bioRxiv Field report

Physics scoring rescues AI binders

Statistical physics scoring filters out hallucinated protein binders from generative AI pipelines, using a zero-shot ensemble approach that needs no task-specific training. The method flags designs that look plausible to the generator but fail thermodynamic sanity checks. Moves AI-designed binder workflows closer to deployment-viable by attacking the false-positive rate that has dogged every public benchmark so far.

Read more
LLMs describe monkey visual neurons
Fig. IIIarXiv · Filed 18 May 2026.
Nº 03 arXiv Field report

LLMs describe monkey visual neurons

Language models characterize what individual monkey visual neurons respond to, generating natural-language descriptions of tuning properties directly from neural recordings. The pipeline turns hours of electrophysiology interpretation into automated captions a neuroscientist can read. Narrows the gap between raw recording data and shareable functional annotation — a workflow that has resisted automation for decades.

Read more
Also Filed · One Brief from the queue
Nº 04 arXiv Benchmarks · Evaluation

Multi-hop disease reasoning benchmark drops

MedHopQA tests multi-hop biomedical reasoning on disease-centered questions that require chaining facts across sources, going beyond single-lookup QA benchmarks. Anchors a tougher reference point for LLM-based clinical question answering, where most published scores still come from the kind of one-hop retrieval that flatters the models — a pattern BiomniBench exposed last week.

Read

Reply with your discoveries. A human reads them. Forward freely.

Agentic Discovery  ·  Nº 17  ·  18 May 2026

Editor's Note

Monday opens with a quiet but consequential question: can the open-weight models running in your own basement actually orchestrate real biology yet?

 

Nº 01 · The Lede  —  bioRxiv  —  Field report

Open LLMs tested as lab orchestrators

Open LLMs tested as lab orchestrators

Fig. I  bioRxiv · Filed 18 May 2026.

Open-weight LLMs evaluated as agentic orchestrators for routine biomedical analysis in a new bioRxiv preprint from the Galaxy team, benchmarking models a typical lab can actually self-host against the closed frontier on multi-step pipeline planning. The work scores how reliably each model chains tool calls, recovers from errors, and produces analyses a bench scientist would accept. Results map which open checkpoints clear the bar today and where they still drop calls. Sets the first concrete reference point for self-hosted agentic analysis in biology — the debate over whether labs need frontier API access for orchestration now has numbers attached.

Read the source →

Why it matters

Self-hosting agentic biology shifts from aspiration to a measurable gap; vendors pitching closed-model dependence for lab orchestration now have a published yardstick working against them.

 

Nº 02  —  bioRxiv  —  Field report

Physics scoring rescues AI binders

Fig. II  bioRxiv · Filed 18 May 2026.

Physics scoring rescues AI binders

Statistical physics scoring filters out hallucinated protein binders from generative AI pipelines, using a zero-shot ensemble approach that needs no task-specific training. The method flags designs that look plausible to the generator but fail thermodynamic sanity checks. Moves AI-designed binder workflows closer to deployment-viable by attacking the false-positive rate that has dogged every public benchmark so far.

Read more →

 

Nº 03  —  arXiv  —  Field report

LLMs describe monkey visual neurons

Fig. III  arXiv · Filed 18 May 2026.

LLMs describe monkey visual neurons

Language models characterize what individual monkey visual neurons respond to, generating natural-language descriptions of tuning properties directly from neural recordings. The pipeline turns hours of electrophysiology interpretation into automated captions a neuroscientist can read. Narrows the gap between raw recording data and shareable functional annotation — a workflow that has resisted automation for decades.

Read more →

 

Also Filed  ·  One Brief from the queue

Nº 04  —  arXiv  —  Benchmarks · Evaluation

Multi-hop disease reasoning benchmark drops

MedHopQA tests multi-hop biomedical reasoning on disease-centered questions that require chaining facts across sources, going beyond single-lookup QA benchmarks. Anchors a tougher reference point for LLM-based clinical question answering, where most published scores still come from the kind of one-hop retrieval that flatters the models — a pattern BiomniBench exposed last week.

Read →

 

· · ·

Reply with your discoveries. A human reads them. Forward freely.