Deepjoin: Joinable Table Discovery With Pre-trained Language Models
2022 Β· Yuyang Dong, Chuan Xiao, Takuma Nozawa, et al.
Abstract
Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to b
Authors
(none)
Tags
Stats
Related papers
- Pylon: Semantic Table Union Search In Data Lakes (2023)0.00
- PLUM: Adapting Pre-trained Language Models For Industrial-scale Generative Recommendations (2025)2.26
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- CGPT: Cluster-guided Partial Tables With Llm-generated Supervision For Table Retrieval (2026)1.57
- Transforming Llms Into Cross-modal And Cross-lingual Retrieval Systems (2024)4.52
- Training Llms To Be Better Text Embedders Through Bidirectional Reconstruction (2025)0.00
- Dataset Discovery In Data Lakes (2020)15.28