An Innovative Zero-Shot Inference Approach Based on Deep Learning Multi-modal Training

Authors: Zhuo Lei, Wei Li, Xiangwei Zhang, Qiang Yu, Lidan Shou, Shengquan Li, Yunqing Mao
Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages: 133-146
Keywords: Zero-shot Learning, Multi-modal, Object Detection, Transformer

Abstract

We present a novel multi-modal zero-shot inference framework for urban management applications, particularly in retail environments. The deep learning model fuses multi-scale CNN-based object detection with self-attention mechanisms to enhance the identification of unauthorized activities and complex categorization tasks in fixed-point surveillance scenarios. Innovative components include lightweight channel aggregation modules that reduce high-dimensional representations and intermediate interactions are captured through multi-stage gate aggregation. Spatial aggregation extracts context-aware multi-level features, addressing limitations of traditional DNN. Attention down-sampling is integrated to address computational challenges when applying Transformers on high-resolution imagery. Multi-modal learning bypasses explicit class labels by directly training on raw text-image pairs using contrastive learning. This enables the model to learn from natural language supervision and perform zero-shot recognition across unseen categories. We obtain the state-of-the-art performance both in public dataset and our own urban management dataset.
📄 View Full Paper (PDF) 📋 Show Citation