3 min read

Benchmarks break; biology agents advance

Good Morning!

Today's set spans a retired coding benchmark, a billion-dollar self-learning bet, and two clinical AI systems that actually touch patient data.


1. Agent matches experts on myeloma records

Agent matches experts on myeloma records

Agentic clinical reasoning over longitudinal multiple myeloma records reached expert-consensus accuracy in a retrospective evaluation, according to a new arXiv preprint. The agent — built on a large language model with access to structured EHR (electronic health record) data — was scored against oncologist panel decisions across staging, treatment sequencing, and response assessment. It matched expert consensus at rates competitive with individual clinician agreement, with performance tracked across multi-visit timelines rather than single-snapshot queries. Read More →

Why it matters: This is one of the first published evaluations of an LLM-based agent on longitudinal, multi-visit oncology records benchmarked against a structured expert panel — giving clinical AI teams a concrete bar to cite when scoping myeloma decision-support applications.


2. FastOMOP targets real-world evidence agents

FastOMOP targets real-world evidence agents

FastOMOP structures agentic queries against OMOP CDM (Observational Medical Outcomes Partnership Common Data Model — the standardized schema used by most large health system databases) data, giving agents a reliable architecture for generating real-world evidence without ad hoc SQL generation. The preprint describes validation layers that catch schema mismatches before agents commit to downstream analysis, a common failure mode in unguarded LLM-to-database pipelines. Read More →


3. David Silver bets $1.1B on data-free AI

David Silver bets $1.1B on data-free AI

David Silver, AlphaGo's lead architect, raised $1.1 billion for a new company building AI systems that learn entirely through self-play and environment interaction — no human-labeled training data. The approach echoes the reinforcement learning (RL) methods behind AlphaZero and AlphaFold's structure-prediction engine, but applied to a broader set of domains. For biomedical researchers, the bet is notable: RL-from-scratch approaches have already outperformed supervised models on protein folding and drug-target interaction tasks where labeled data is sparse. Read More →


4. Protein LMs flag AMR variant risk

Protein LMs flag AMR variant risk

Protein language models predict which novel AMR (antimicrobial resistance) variants are highest-risk before they spread, using sequence-level embeddings to score fitness and resistance likelihood without requiring wet-lab screening of every candidate. Read More →


5. Generative model designs DNA-binding proteins

Generative model designs DNA-binding proteins

Sequence-specific DNA-binding proteins generated de novo by a generative model achieve programmable targeting in a new bioRxiv preprint, extending protein design beyond enzymes and binders into transcription-factor-like architectures. Read More →


6. OpenAI retires SWE-Bench over gaming

OpenAI retires SWE-Bench over gaming

SWE-Bench Verified retired by OpenAI after evidence that agent developers were over-fitting to its test cases — a process called benchmaxxing, where scores rise without real capability gains. OpenAI now says it will no longer use the benchmark internally. Read More →


Reply to talk back — this email comes to a human (newsletter@heurekalabs.co). Forward freely.