Abstract

Retrieval-enhanced language models (LMs), which condition their predictions on text retrieved from large external datastores, have recently shown significant perplexity improvements compared to standard LMs. One such approach, the \(k\)NN-LM, interpolates any existing LM's predictions with the output of a \(k\)-nearest neighbors model and requires no additional training. In this paper, we explore the importance of lexical and semantic matching in the context of items retrieved by \(k\)NN-LM. We find two trends: (1) the presence of large overlapping \(n\)-grams between the datastore and evaluation set plays an important factor in strong performance, even when the datastore is derived from the training data; and (2) the \(k\)NN-LM is most beneficial when retrieved items have high semantic similarity with the query. Based on our analysis, we define a new formulation of the \(k\)NN-LM that uses retrieval quality to assign the interpolation coefficient. We empirically measure the effectiven

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations4
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score5.24
  • arxiv keydrozdov2022you

Related papers