Benchmarks break; biology agents advance
Good Morning!
Today's set spans a retired coding benchmark, a billion-dollar self-learning bet, and two clinical AI systems that actually touch patient data.
1. Agent matches experts on myeloma records

Agentic clinical reasoning over longitudinal multiple myeloma records reached expert-consensus accuracy in a retrospective evaluation, according to a new arXiv preprint. The agent — built on a large language model with access to structured EHR (electronic health record) data — was scored against oncologist panel decisions across staging, treatment sequencing, and response assessment. It matched expert consensus at rates competitive with individual clinician agreement, with performance tracked across multi-visit timelines rather than single-snapshot queries. Read More →
Why it matters: This is one of the first published evaluations of an LLM-based agent on longitudinal, multi-visit oncology records benchmarked against a structured expert panel — giving clinical AI teams a concrete bar to cite when scoping myeloma decision-support applications.
2. FastOMOP targets real-world evidence agents

FastOMOP structures agentic queries against OMOP CDM (Observational Medical Outcomes Partnership Common Data Model — the standardized schema used by most large health system databases) data, giving agents a reliable architecture for generating real-world evidence without ad hoc SQL generation. The preprint describes validation layers that catch schema mismatches before agents commit to downstream analysis, a common failure mode in unguarded LLM-to-database pipelines. Read More →
3. David Silver bets $1.1B on data-free AI

David Silver, AlphaGo's lead architect, raised $1.1 billion for a new company building AI systems that learn entirely through self-play and environment interaction — no human-labeled training data. The approach echoes the reinforcement learning (RL) methods behind AlphaZero and AlphaFold's structure-prediction engine, but applied to a broader set of domains. For biomedical researchers, the bet is notable: RL-from-scratch approaches have already outperformed supervised models on protein folding and drug-target interaction tasks where labeled data is sparse. Read More →
4. Protein LMs flag AMR variant risk

Protein language models predict which novel AMR (antimicrobial resistance) variants are highest-risk before they spread, using sequence-level embeddings to score fitness and resistance likelihood without requiring wet-lab screening of every candidate. Read More →
5. Generative model designs DNA-binding proteins

Sequence-specific DNA-binding proteins generated de novo by a generative model achieve programmable targeting in a new bioRxiv preprint, extending protein design beyond enzymes and binders into transcription-factor-like architectures. Read More →
6. OpenAI retires SWE-Bench over gaming

SWE-Bench Verified retired by OpenAI after evidence that agent developers were over-fitting to its test cases — a process called benchmaxxing, where scores rise without real capability gains. OpenAI now says it will no longer use the benchmark internally. Read More →
Reply to talk back — this email comes to a human (newsletter@heurekalabs.co). Forward freely.