An Innovative Zero-Shot Inference Approach Based on Deep Learning Multi-modal Training

Authors: Zhuo Lei, Wei Li, Xiangwei Zhang, Qiang Yu, Lidan Shou, Shengquan Li, Yunqing Mao

Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024

Pages: 133-146

Keywords: Zero-shot Learning, Multi-modal, Object Detection, Transformer

DOI: 10.65286/icic.v20i2.81732

Abstract

We present a novel multi-modal zero-shot inference framework for urban management applications, particularly in retail environments. The deep learning model fuses multi-scale CNN-based object detection with self-attention mechanisms to enhance the identification of unauthorized activities and complex categorization tasks in fixed-point surveillance scenarios. Innovative components include lightweight channel aggregation modules that reduce high-dimensional representations and intermediate interactions are captured through multi-stage gate aggregation. Spatial aggregation extracts context-aware multi-level features, addressing limitations of traditional DNN. Attention down-sampling is integrated to address computational challenges when applying Transformers on high-resolution imagery. Multi-modal learning bypasses explicit class labels by directly training on raw text-image pairs using contrastive learning. This enables the model to learn from natural language supervision and perform zero-shot recognition across unseen categories. We obtain the state-of-the-art performance both in public dataset and our own urban management dataset.

BibTeX Citation:

@inproceedings{ICIC2024,
    author = {Zhuo Lei, Wei Li, Xiangwei Zhang, Qiang Yu, Lidan Shou, Shengquan Li, Yunqing Mao},
    title = {An Innovative Zero-Shot Inference Approach Based on Deep Learning Multi-modal Training},
    booktitle = {Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024)},
    month = {August},
    date = {5-8},
    year = {2024},
    address = {Tianjin, China},
    pages = {133-146},
    note = {Poster Volume Ⅱ}
    doi = {10.65286/icic.v20i2.81732}
}