Modeltables: A Corpus Of Tables About Models
2025 · Zhengyuan Dong, Victor Zhong, Renée J. Miller
Abstract
We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, spa
Authors
(none)
Tags
Stats
Related papers
- Pylon: Semantic Table Union Search In Data Lakes (2023)0.00
- Table2vec: Neural Word And Entity Embeddings For Table Population And Retrieval (2019)13.55
- CGPT: Cluster-guided Partial Tables With Llm-generated Supervision For Table Retrieval (2026)1.57
- Deepjoin: Joinable Table Discovery With Pre-trained Language Models (2022)12.25
- Strubert: Structure-aware BERT For Table Search And Matching (2022)10.97
- Dataset Discovery In Data Lakes (2020)15.28
- Multi-modal Retrieval Of Tables And Texts Using Tri-encoder Models (2021)6.34
- Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications (2024)5.84