A virtual-cell benchmark with teeth
-
Nº XIII
- Date
- 12 May 2026
- Issue
- Thirteen
- Stories
- Six
- Editor
- Agentic Discovery
Tuesday's haul: a real virtual-cell benchmark lands, agents start reading QSP papers, and zero-shot LLMs guess CRISPR hits without ever seeing a screen.
Virtual-cell benchmark gets real
AssayBench scores LLMs on assay-level virtual-cell tasks — predicting readouts from perturbation experiments rather than retrieving textbook facts. The benchmark pits agents against real assay data across dose-response, viability, and transcriptomic endpoints, and current frontier models clear the easy splits but stall on anything requiring quantitative extrapolation. Anchors a new reference floor for virtual-cell AI claims: vendors pitching cell-scale prediction now have a public score to beat, and the gap between "reads biology" and "predicts biology" finally has a number.
Agents pull QSP from papers
Talk2QSP turns literature into executable quantitative systems pharmacology scenarios, with a human-in-the-loop agent extracting parameters, compartments, and rate equations directly from unstructured text. QSP modeling has historically been a weeks-long manual reading job before a single simulation runs. Collapses one of the slowest handoffs in mechanistic pharmacology, moving model-building from artisanal to agent-assisted.
LLMs guess synthetic lethals cold
Zero-shot reasoning reproduces CRISPR-screen synthetic lethal predictions using open-weights LLMs with no fine-tuning and no screen data — just gene-pair prompts. The reproductions aren't perfect, but they recover known hits well above chance. Reopens a debate we've tracked over how much functional-genomics signal is already latent in pretraining corpora, and whether expensive screens are validating LLM priors as often as they're discovering new biology.
Steerable molecule editing
SLIM steers molecular edits through sparse latent directions in an LLM, letting chemists nudge generated molecules toward specific properties (solubility, logP, toxicity flags) without retraining. Moves property-directed generation from black-box sampling toward interpretable knobs — narrowing the gap between generative chemistry and the medicinal-chemistry review it has to survive.
Claude Opus 4.7 ships
Anthropic released Claude Opus 4.7 with longer-horizon agent work, self-verification before reporting back, and file-system memory across sessions. The system card discloses bio evals — LAB-Bench, VCT, WMDP-Bio, GPQA-Bio — without naming training corpora. Raises the floor for what a frontier agent should ship; biology evals are now standard disclosure even when training data isn't.
Reproducible tests for browser agents
Resurf open-sourced a test framework that records realistic browser sessions and replays them deterministically against AI agents — closing the reproducibility hole that has made browser-agent evaluation a coin flip. Relevant wherever agents drive web-based lab tools, ELNs, or public bio databases, where flaky tests have masked real regressions.
Reply with your discoveries. A human reads them. Forward freely.
|