Abstract
Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size $\sigma$, a minimizer is defined by two integers $k,w\ge2$ and a linear order $\rho$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $\rho$. A key characteristic of the minimizer is its density, which is the expected frequency of chosen $k$-mers among all $k$-mers in a random infinite $\sigma$-ary string. Minimizers of smaller density are preferred as they produce smaller samples with the same guarantee: each window is represented by a $k$-mer. The problem of finding a minimizer of minimum density for given input parameters $(\sigma,k,w)$ has a huge search space of $(\sigma^k)!$ and is representable by an ILP of size $\tilde\Theta(\sigma^{k+w})$, which has worst-case solution time that is doubly-exponential in $(k+w)$ under standard complexity assumptions. We solve this problem in $w\cdot 2^{\sigma^k+O(k)}$ time and provide several additional tricks reducing the practical runtime and search space. As a by-product, we describe an algorithm computing the average density of a minimizer within the same time bound. Then we propose a novel method of studying minimizers via regular languages and show how to find, via the eigenvalue/eigenvector analysis over finite automata, minimizers with the minimal density in the asymptotic case $w\to\infty$. Implementing our algorithms, we compute the minimum density minimizers for $(\sigma,k)\in\{(2,2),(2,3),(2,4),(2,5),(4,2)\}$ and \textbf{all} $w\ge 2$. The obtained densities are compared against the average density and the theoretical lower bounds, including the new bound presented in this paper.