← all papers Β· overview

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

Abstract

We devise $\epsilon$-coresets for robust $(k,z)$-Clustering with $m$ outliers through black-box reductions to vanilla case. Given an $\epsilon$-coreset construction for vanilla clustering with size $N$, we construct coresets of size $N\cdot \mathrm{poly}\log(km\epsilon^{-1}) + O_z\left(\min\{km\epsilon^{-1}, m\epsilon^{-2z}\log^z(km\epsilon^{-1}) \}\right)$ for various metric spaces, where $O_z$ hides $2^{O(z\log z)}$ factors. This increases the size of the vanilla coreset by a small multiplicative factor of $\mathrm{poly}\log(km\epsilon^{-1})$, and the additive term is up to a $(\epsilon^{-1}\log (km))^{O(z)}$ factor to the size of the optimal robust coreset. Plugging in vanilla coreset results of [Cohen-Addad et al., STOC'21], we obtain the first coresets for $(k,z)$-Clustering with $m$ outliers with size near-linear in $k$ while previous results have size at least $\Omega(k^2)$ [Huang et al., ICLR'23; Huang et al., SODA'25]. Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures - it can be broken into "dense" parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only $O_z(km \epsilon^{-1})$ non-dense points to obtain the $O_z(km \epsilon^{-1})$ additive bound. Another condition requires the vanilla coreset to possess an extra size-preserving property. We further give a black-box reduction that turns a vanilla coreset to the one satisfying the said size-preserving property, leading to the alternative $O_z(m\epsilon^{-2z}\log^{z}(km\epsilon^{-1}))$ additive bound. We also implement our reductions in the dynamic streaming setting and obtain the first streaming algorithms for $k$-Median and $k$-Means with $m$ outliers, using space $\tilde{O}(k+m)\cdot\mathrm{poly}(d\epsilon^{-1}\log\Delta)$ for inputs on the grid $[\Delta]^d$.

Related papers

Ranked by semantic similarity β€” how closely each paper's abstract matches this one (100% = near-identical topic).