cluster #7
50 papers in this cluster (ordered by heat_score)
Papers
- Videoclip: Contrastive Pre-training For Zero-shot Video-text Understanding (2021)Hu Xu, Gargi Ghosh, Po-Yao Huang, et al.28.04
- Less Is More: Clipbert For Video-and-language Learning Via Sparse Sampling (2021)Jie Lei, Linjie Li, Luowei Zhou, et al.25.76
- WIT: Wikipedia-based Image Text Dataset For Multimodal Multilingual Machine Learning (2021)Krishna Srinivasan, Karthik Raman, Jiecao Chen, et al.22.32
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, et al.21.57
- Howto100m: Learning A Text-video Embedding By Watching Hundred Million Narrated Video Clips (2019)Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al.21.44
- Cap4video: What Can Auxiliary Captions Do For Text-video Retrieval? (2022)Wenhao Wu, Haipeng Luo, Bo Fang, et al.20.22
- Multi-modal Transformer For Video Retrieval (2020)Valentin Gabeur, Chen Sun, Karteek Alahari, et al.19.47
- Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework (2020)Li Tao, Xueting Wang, Toshihiko Yamasaki18.58
- Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning (2020)Shizhe Chen, Yida Zhao, Qin Jin, et al.18.27
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)Yiwei Ma, Guohai Xu, Xiaoshuai Sun, et al.18.12
- Multilevel Language And Vision Integration For Text-to-clip Retrieval (2018)Huijuan Xu, Kun He, Bryan A. Plummer, et al.17.67
- Blazingly Fast Video Object Segmentation With Pixel-wise Metric Learning (2018)Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, et al.17.46
- Revisiting Temporal Modeling For Clip-based Image-to-video Knowledge Transferring (2023)Ruyang Liu, Jingjia Huang, Ge Li, et al.17.40
- Recipe1m+: A Dataset For Learning Cross-modal Embeddings For Cooking Recipes And Food Images (2018)Javier Marin, Aritro Biswas, Ferda Ofli, et al.17.24
- X-pool: Cross-modal Language-video Attention For Text-video Retrieval (2022)Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, et al.16.99
- T2VLAD: Global-local Sequence Alignment For Text-video Retrieval (2021)Xiaohan Wang, Linchao Zhu, Yi Yang16.65
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)Siteng Huang, Biao Gong, Yulin Pan, et al.16.41
- Cico: Domain-aware Sign Language Retrieval Via Cross-lingual Contrastive Learning (2023)Yiting Cheng, Fangyun Wei, Jianmin Bao, et al.16.35
- Audio Retrieval With Natural Language Queries: A Benchmark Study (2021)A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, et al.16.29
- Object-aware Video-language Pre-training For Retrieval (2021)Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, et al.16.14
- Dual Encoding For Video Retrieval By Text (2020)Jianfeng Dong, Xirong Li, Chaoxi Xu, et al.16.05
- Hit: Hierarchical Transformer With Momentum Contrast For Video-text Retrieval (2021)Song Liu, Haoqi Fan, Shengsheng Qian, et al.15.98
- TF-CLIP: Learning Text-free CLIP For Video-based Person Re-identification (2023)Chenyang Yu, Xuehu Liu, Yingquan Wang, et al.15.81
- TCLR: Temporal Contrastive Learning For Video Representation (2021)Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, et al.15.78
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)Nina Shvetsova, Brian Chen, Andrew Rouditchenko, et al.15.78
- Fine-grained Action Retrieval Through Multiple Parts-of-speech Embeddings (2019)Michael Wray, Diane Larlus, Gabriela Csurka, et al.15.62
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)Xun Yang, Jianfeng Dong, Yixin Cao, et al.15.57
- Panda-70m: Captioning 70M Videos With Multiple Cross-modality Teachers (2024)Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, et al.15.54
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)Shuai Zhao, Linchao Zhu, Xiaohan Wang, et al.15.54
- Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval (2022)Yuqi Liu, Pengfei Xiong, Luhui Xu, et al.15.51
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)Bo Fang, Wenhao Wu, Chang Liu, et al.15.46
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, et al.15.43
- Bridging Video-text Retrieval With Multiple Choice Questions (2022)Yuying Ge, Yixiao Ge, Xihui Liu, et al.15.37
- Dns: Distill-and-select For Efficient And Accurate Video Indexing And Retrieval (2021)Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, et al.14.99
- End-to-end Cross-modality Retrieval With CCA Projections And Pairwise Ranking Loss (2017)Matthias Dorfer, Jan Schlüter, Andreu Vall, et al.14.68
- A Modulation Module For Multi-task Learning With Applications In Image Retrieval (2018)Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, et al.14.58
- DGL: Dynamic Global-local Prompt Tuning For Text-video Retrieval (2024)Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, et al.14.35
- Video Corpus Moment Retrieval With Contrastive Learning (2021)Hao Zhang, Aixin Sun, Wei Jing, et al.14.35
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)Zhihang Liu, Jun Li, Hongtao Xie, et al.14.33
- Clover: Towards A Unified Video-language Alignment And Fusion Model (2022)Jingjia Huang, Yinan Li, Jiashi Feng, et al.14.30
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)David Harwath, Adrià Recasens, Dídac Surís, et al.14.27
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang14.19
- Transformer Decoders With Multimodal Regularization For Cross-modal Food Retrieval (2022)Mustafa Shukor, Guillaume Couairon, Asya Grechka, et al.14.17
- Deep Cross-modal Correlation Learning For Audio And Lyrics In Music Retrieval (2017)Yi Yu, Suhua Tang, Francisco Raposo, et al.14.06
- Revamping Cross-modal Recipe Retrieval With Hierarchical Transformers And Self-supervised Learning (2021)Amaia Salvador, Erhan Gundogdu, Loris Bazzani, et al.13.97
- Reading-strategy Inspired Visual Representation Learning For Text-to-video Retrieval (2022)Jianfeng Dong, Yabing Wang, Xianke Chen, et al.13.93
- Prior Knowledge Integration Via LLM Encoding And Pseudo Event Regulation For Video Moment Retrieval (2024)Yiyang Jiang, Wengyu Zhang, Xulu Zhang, et al.13.83
- Visil: Fine-grained Spatio-temporal Video Similarity Learning (2019)Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, et al.13.70
- Transvcl: Attention-enhanced Video Copy Localization Network With Flexible Supervision (2022)Sifeng He, Yue He, Minlong Lu, et al.13.47
- MCEN: Bridging Cross-modal Gap Between Cooking Recipes And Dish Images With Latent Variable Model (2020)Han Fu, Rui Wu, Chenghao Liu, et al.13.39