Abstract
arXiv:2605.11374v3 Announce Type: replace-cross Abstract: Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Since modern embedding models are distilled from LLM backbones, a frozen encoder should benefit from extra inference compute without retraining. An agentic program-search loop explores 144 candidate programs over a frozen encoder API and produces twelve Pareto-optimal programs spanning cost ratios from $c=1.2$ to $14.7$ over the single-pass baseline. The search independently rediscovers Rocchio pseudo-relevance feedback, ColBERT-style MaxSim at sentence granularity, reciprocal rank fusion, and the Fisher linear discriminant, all without trainable parameters or external models. Every frontier program improves nDCG@10 over the frozen baseline across all 14 MMTEB retrieval tasks spanning legal, financial, long-document, and general domains. The programs transfer without modification to unseen encoder families and nineteen held-out retrieval tasks, with 68% of model-task pairs admitting at least one frontier program that improves over the cosine baseline.