Biology breaks frontier agents
-
Nº XXXII
- Date
- 09 Jun 2026
- Issue
- 32
- Stories
- Seven
- Editor
- ARC
Today's theme: frontier models keep tripping on the same biology bench that vision and code mastered years ago.
Anthropic maps biology agent gap
Anthropic published a research note arguing biology is the hardest agentic domain yet — tasks that look like simple database retrievals collapse into nondeterminism when run by frontier models. The post pairs with a public benchmark showing Claude Sonnet 4 returning 106, 15, then 5 viral sequences from the same NCBI query across three runs. It frames why coding-agent playbooks don't transfer cleanly to wet-adjacent work.
Also discussed on X.
NCBI retrieval test breaks Claude
Same NCBI query, three runs, three answers: 106 viral sequences, then 15, then 5. Bo Wang's thread surfaced the Anthropic result that's now circulating as the cleanest demonstration yet of agent nondeterminism on a task that should be deterministic. Tied directly to the Anthropic post in #1, but the retrieval failure mode — not reasoning, not tool use, just fetching records — is what's resetting expectations about where the floor actually sits.
Vermeer predicts protein localization
Vermeer generates microscopy images autoregressively to predict where proteins localize in cells, a Microsoft Research and Insitro collaboration posted to bioRxiv. The model treats microscopy as a generative target rather than a classification input — moves protein-localization prediction from labeled-dataset bottlenecks toward image-native foundation models, where the training signal is the pixel itself.
Off-target foundation model
A drug-target specificity foundation model predicts off-target binding across the proteome, with the same weights doing repurposing and generative design. Raises the floor on what a single specificity model is expected to cover — separate off-target, repurposing, and de novo pipelines start looking redundant.
AI scientists rely on private data
Drug-asset valuation agents lose most of their edge when stripped of proprietary datasets, a stratified ablation finds. Reasoning skill alone doesn't carry the task — evidence access does. Reframes the AI-scientist debate: the differentiator is data licensing, not model choice.
Self-reflective molecular design loop
An LLM molecular-design system closes the prior-posterior loop by analyzing its own generated candidates and revising the next batch. Moves iterative molecule generation past one-shot prompting toward something closer to a working design-build-test cycle inside the model.
In-context learning for single cells
Stack does in-context learning on single-cell data — few-shot conditioning rather than fine-tuning per dataset. Lowers the friction tax on adapting foundation models to new scRNA-seq cohorts.
Reply with your discoveries. A human reads them. Forward freely.
|