Learning To Tokenize For Generative Retrieval
2023 Β· Weiwei Sun, Lingyong Yan, Zheng Chen, et al.
Abstract
Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in
Authors
(none)
Tags
Stats
Related papers
- Bootstrapped Pre-training With Dynamic Identifier Prediction For Generative Retrieval (2024)4.52
- Lightweight And Direct Document Relevance Optimization For Generative Information Retrieval (2025)4.52
- Generative Retrieval As Multi-vector Dense Retrieval (2024)8.60
- Generative Retrieval Meets Multi-graded Relevance (2024)2.26
- GLEN: Generative Retrieval Via Lexical Index Learning (2023)9.29
- CAT-ID\(^2\): Category-tree Integrated Document Identifier Learning For Generative Retrieval In E-commerce (2025)0.00
- Tokenrec: Learning To Tokenize ID For Llm-based Generative Recommendation (2024)7.50
- Learning To Rank In Generative Retrieval (2023)11.91