29 Apr 2026 3 min read

Agents break in production — here's the data

Good Morning!

Today's issue lands at the intersection of agent reliability and biomolecular AI — two fronts moving fast in opposite directions.

1. LLMs corrupt documents at scale

Frontier LLMs corrupt roughly 25% of document content during long edit workflows, according to a new benchmark from Microsoft. The evaluation (called LOFT — Long-context Output Faithfulness Test) stress-tests models on multi-step document editing, revealing that drift and hallucinated insertions compound over successive revisions. For biomedical teams building agents that annotate or revise clinical notes, protocols, or regulatory submissions, that corruption rate is a hard ceiling on trust without a verification layer. Read More →

Why it matters: Any biomedical agent pipeline touching regulatory documents or clinical text now has a concrete failure rate to benchmark against — ask your LLM vendor what their LOFT score is before deploying on anything submission-critical.

2. Multimodal biomolecule foundation model drops

MIMIC unifies sequence, structure, and functional data into a single generative foundation model (a large model pre-trained across multiple molecular modalities) for biomolecules, posted to arXiv. The architecture handles proteins, small molecules, and nucleic acids in one representational space, enabling cross-modal generation — predicting structure from sequence or function from partial structure — without separate task-specific models. Read More →

3. AI-designed photoactivatable PARP1 inhibitors validated

Computationally designed PARP1 inhibitors with photoactivatable (light-switchable) caging groups were experimentally validated in a new arXiv preprint, closing the loop from in silico design to bench confirmation. The workflow couples structure-based design with photochemistry constraints, demonstrating that AI-driven molecular design pipelines can now incorporate non-standard chemical handles beyond classical pharmacophores. Read More →

4. Testing agents in production stays hard

A r/MachineLearning thread on testing AI agents (autonomous LLM systems that call tools and take multi-step actions) in live production environments drew practitioners describing brittle eval suites, cascading failures from tool-call errors, and the difficulty of distinguishing model drift from environmental change. The recurring recommendation: shadow-mode deployment — running the agent in parallel with human workflows before going live — plus structured logging of every tool invocation. Read More →

5. Perturbation response transfer across cell contexts

HyperMap transfers perturbation response predictions across diverse biological contexts — cell lines, tissues, species — using a lightweight transfer framework on bioRxiv. Relevant for teams building agents that generalize CRISPR or drug-response models beyond their training distribution. Read More →

6. PROTEUS maps protein conformational states

PROTEUS models conformational plasticity in proteins using an ensemble-aware framework posted to bioRxiv, giving computational teams a structured tool for characterizing multiple functional states — relevant for allosteric drug design and agent-driven docking pipelines that must account for receptor flexibility. Read More →

Reply with your discoveries. A human reads them. Forward freely.