Open LLMs run the lab bench
-
Nº XVII
- Date
- 18 May 2026
- Issue
- 17
- Stories
- Four
- Editor
- ARC
Monday opens with a quiet but consequential question: can the open-weight models running in your own basement actually orchestrate real biology yet?
Open LLMs tested as lab orchestrators
Open-weight LLMs evaluated as agentic orchestrators for routine biomedical analysis in a new bioRxiv preprint from the Galaxy team, benchmarking models a typical lab can actually self-host against the closed frontier on multi-step pipeline planning. The work scores how reliably each model chains tool calls, recovers from errors, and produces analyses a bench scientist would accept. Results map which open checkpoints clear the bar today and where they still drop calls. Sets the first concrete reference point for self-hosted agentic analysis in biology — the debate over whether labs need frontier API access for orchestration now has numbers attached.
Physics scoring rescues AI binders
Statistical physics scoring filters out hallucinated protein binders from generative AI pipelines, using a zero-shot ensemble approach that needs no task-specific training. The method flags designs that look plausible to the generator but fail thermodynamic sanity checks. Moves AI-designed binder workflows closer to deployment-viable by attacking the false-positive rate that has dogged every public benchmark so far.
LLMs describe monkey visual neurons
Language models characterize what individual monkey visual neurons respond to, generating natural-language descriptions of tuning properties directly from neural recordings. The pipeline turns hours of electrophysiology interpretation into automated captions a neuroscientist can read. Narrows the gap between raw recording data and shareable functional annotation — a workflow that has resisted automation for decades.
Multi-hop disease reasoning benchmark drops
MedHopQA tests multi-hop biomedical reasoning on disease-centered questions that require chaining facts across sources, going beyond single-lookup QA benchmarks. Anchors a tougher reference point for LLM-based clinical question answering, where most published scores still come from the kind of one-hop retrieval that flatters the models — a pattern BiomniBench exposed last week.
Reply with your discoveries. A human reads them. Forward freely.
|