Abstract

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit\{zero-shot\} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit\{imbalance\} and \textit\{noise\} issues. To address these, we propose \textit\{cross-modal selective self-training\} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also

Authors

(none)

Tags

  • Audio Understanding

Stats

  • citations0
  • S2 citationsβ€”
  • github stars9
  • HF likes0
  • heat score2.00
  • arxiv keyhe2023zero

Related papers