5 min read

A virtual-cell benchmark with teeth

A virtual-cell benchmark with teeth
Nº 01 · The Lede arXiv Cell biology · Funding

Virtual-cell benchmark gets real

Virtual-cell benchmark gets real
Fig. IarXiv, 12 May 2026.

AssayBench scores LLMs on assay-level virtual-cell tasks — predicting readouts from perturbation experiments rather than retrieving textbook facts. The benchmark pits agents against real assay data across dose-response, viability, and transcriptomic endpoints, and current frontier models clear the easy splits but stall on anything requiring quantitative extrapolation. Anchors a new reference floor for virtual-cell AI claims: vendors pitching cell-scale prediction now have a public score to beat, and the gap between "reads biology" and "predicts biology" finally has a number.

Read the source

Agents pull QSP from papers
Fig. IIbioRxiv, 12 May 2026.
Nº 02 bioRxiv Agents · Infrastructure

Agents pull QSP from papers

Talk2QSP turns literature into executable quantitative systems pharmacology scenarios, with a human-in-the-loop agent extracting parameters, compartments, and rate equations directly from unstructured text. QSP modeling has historically been a weeks-long manual reading job before a single simulation runs. Collapses one of the slowest handoffs in mechanistic pharmacology, moving model-building from artisanal to agent-assisted.

Read more
LLMs guess synthetic lethals cold
Fig. IIIbioRxiv, 12 May 2026.
Nº 03 bioRxiv Field report

LLMs guess synthetic lethals cold

Zero-shot reasoning reproduces CRISPR-screen synthetic lethal predictions using open-weights LLMs with no fine-tuning and no screen data — just gene-pair prompts. The reproductions aren't perfect, but they recover known hits well above chance. Reopens a debate we've tracked over how much functional-genomics signal is already latent in pretraining corpora, and whether expensive screens are validating LLM priors as often as they're discovering new biology.

Read more
Also Filed · Three Briefs from the queue
Nº 04 arXiv Field report

Steerable molecule editing

SLIM steers molecular edits through sparse latent directions in an LLM, letting chemists nudge generated molecules toward specific properties (solubility, logP, toxicity flags) without retraining. Moves property-directed generation from black-box sampling toward interpretable knobs — narrowing the gap between generative chemistry and the medicinal-chemistry review it has to survive.

Read
Nº 05 Anthropic Field report

Claude Opus 4.7 ships

Anthropic released Claude Opus 4.7 with longer-horizon agent work, self-verification before reporting back, and file-system memory across sessions. The system card discloses bio evals — LAB-Bench, VCT, WMDP-Bio, GPQA-Bio — without naming training corpora. Raises the floor for what a frontier agent should ship; biology evals are now standard disclosure even when training data isn't.

Read
Nº 06 Hacker News Agents · Infrastructure

Reproducible tests for browser agents

Resurf open-sourced a test framework that records realistic browser sessions and replays them deterministically against AI agents — closing the reproducibility hole that has made browser-agent evaluation a coin flip. Relevant wherever agents drive web-based lab tools, ELNs, or public bio databases, where flaky tests have masked real regressions.

Read

Reply with your discoveries. A human reads them. Forward freely.

Agentic Discovery  ·  Nº Thirteen  ·  12 May 2026

Editor's Note

Tuesday's haul: a real virtual-cell benchmark lands, agents start reading QSP papers, and zero-shot LLMs guess CRISPR hits without ever seeing a screen.

 

Nº 01 · The Lede  —  arXiv  —  Cell biology · Funding

Virtual-cell benchmark gets real

Virtual-cell benchmark gets real

Fig. I  arXiv, 12 May 2026.

AssayBench scores LLMs on assay-level virtual-cell tasks — predicting readouts from perturbation experiments rather than retrieving textbook facts. The benchmark pits agents against real assay data across dose-response, viability, and transcriptomic endpoints, and current frontier models clear the easy splits but stall on anything requiring quantitative extrapolation. Anchors a new reference floor for virtual-cell AI claims: vendors pitching cell-scale prediction now have a public score to beat, and the gap between "reads biology" and "predicts biology" finally has a number.

Read the source →

Why it matters

Virtual-cell AI has been pitched on vibes and cherry-picked demos for two years; AssayBench drops a falsifiable target into that conversation and resets what counts as evidence in a space CZ Biohub is funding at the half-billion-dollar level.

 

Nº 02  —  bioRxiv  —  Agents · Infrastructure

Agents pull QSP from papers

Fig. II  bioRxiv, 12 May 2026.

Agents pull QSP from papers

Talk2QSP turns literature into executable quantitative systems pharmacology scenarios, with a human-in-the-loop agent extracting parameters, compartments, and rate equations directly from unstructured text. QSP modeling has historically been a weeks-long manual reading job before a single simulation runs. Collapses one of the slowest handoffs in mechanistic pharmacology, moving model-building from artisanal to agent-assisted.

Read more →

 

Nº 03  —  bioRxiv  —  Field report

LLMs guess synthetic lethals cold

Fig. III  bioRxiv, 12 May 2026.

LLMs guess synthetic lethals cold

Zero-shot reasoning reproduces CRISPR-screen synthetic lethal predictions using open-weights LLMs with no fine-tuning and no screen data — just gene-pair prompts. The reproductions aren't perfect, but they recover known hits well above chance. Reopens a debate we've tracked over how much functional-genomics signal is already latent in pretraining corpora, and whether expensive screens are validating LLM priors as often as they're discovering new biology.

Read more →

 

Also Filed  ·  Three Briefs from the queue

Nº 04  —  arXiv  —  Field report

Steerable molecule editing

SLIM steers molecular edits through sparse latent directions in an LLM, letting chemists nudge generated molecules toward specific properties (solubility, logP, toxicity flags) without retraining. Moves property-directed generation from black-box sampling toward interpretable knobs — narrowing the gap between generative chemistry and the medicinal-chemistry review it has to survive.

Read →

Nº 05  —  Anthropic  —  Field report

Claude Opus 4.7 ships

Anthropic released Claude Opus 4.7 with longer-horizon agent work, self-verification before reporting back, and file-system memory across sessions. The system card discloses bio evals — LAB-Bench, VCT, WMDP-Bio, GPQA-Bio — without naming training corpora. Raises the floor for what a frontier agent should ship; biology evals are now standard disclosure even when training data isn't.

Read →

Nº 06  —  Hacker News  —  Agents · Infrastructure

Reproducible tests for browser agents

Resurf open-sourced a test framework that records realistic browser sessions and replays them deterministically against AI agents — closing the reproducibility hole that has made browser-agent evaluation a coin flip. Relevant wherever agents drive web-based lab tools, ELNs, or public bio databases, where flaky tests have masked real regressions.

Read →

 

· · ·

Reply with your discoveries. A human reads them. Forward freely.