AI agents meet the biomedical data stack
Good Morning!
Five stories from a quiet weekend — proteomics pipelines, clinical decision logic, and the security threats lurking between agent sessions.
1. AutoML platform benchmarks bio sequences
BioAutoML-FAST is a new automated ML platform (automated machine learning — systems that select and tune models without manual intervention) for biological sequence analysis, published as a bioRxiv preprint by Silva de Almeida et al. The platform packages reusable, pre-benchmarked models for sequence classification tasks, letting researchers skip the setup-and-tune cycle that currently consumes weeks of a computational biologist's time. Benchmark results across multiple sequence types are included, giving teams a concrete baseline to compare against their own pipelines. Read More →
Why it matters: Any lab running sequence-based models — genomic, proteomic, or metagenomic — now has a public benchmark floor to cite when evaluating whether a custom pipeline is actually worth building.
2. LLM scores proteomics data reuse
Hewapathirana et al. built a semi-supervised LLM framework to quantify how often datasets deposited in PRIDE (Proteomics Identifications Database, the main public repository for mass spectrometry data) are reused by downstream studies. Download statistics alone can't distinguish citation reuse from automated crawlers; the LLM layer classifies reuse intent at scale, producing the first systematic reuse map for a major proteomics archive. Read More →
3. Formal logic hardens clinical AI decisions
Bouzinier proposes a framework combining meta-predicates (higher-order logical rules that govern how other rules fire) and DSLs (domain-specific languages — purpose-built code vocabularies) to make clinical decision support systems auditable and formally verifiable, addressing the black-box criticism that blocks regulatory acceptance of LLM-based tools at the bedside. Read More →
4. Cross-session agent attacks, benchmarked
Azarafrooz et al. published a benchmark and evaluation suite for cross-session threats in AI agents — attacks where malicious context planted in one conversation session influences the agent's behavior in a later, separate session. For agents with access to EHR or clinical trial data, this class of vulnerability is particularly dangerous because sessions are long-lived and data is sensitive. Read More →
5. Condensed clinical datasets, geometrically structured

Nganjimi et al. introduce geometric characterization methods and structured trajectory surrogates for clinical dataset condensation — a technique that compresses large training sets into smaller synthetic subsets that preserve model performance, reducing compute cost for clinical ML workflows. Read More →
Reply to talk back — this email comes to a human (newsletter@heurekalabs.co). Forward freely.
Agentic Discovery is a project of Heureka Labs · Unsubscribe