Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Abstract

arXiv:2601.01162v3 Announce Type: replace-cross Abstract: Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

Abstract

Related papers