Muti-Scale Encoder and Temporal Queries Decoder for Video Object Detection

Authors: Hongxiao Yang and Xi Chen
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 3090-3106
Keywords: video object detection, temporal queries, muti-scale features, transformer

Abstract

Video Object Detection VOD leverages temporal information across adja-cent frames in video datasets, enabling the identification and localization be-yond single-frame image object detection. Transformer-based detectors have achieved remarkable performance in static image object detection. However, their application to video object detection lacks sufficient exploration, par-ticularly in aggregating spatial features and temporal features effectively. Re-cent research has replaced handcrafted components in traditional optical flow models and association networks with novel designs to integrate spatial features across frames, thereby incorporating temporal information. Never-theless, these methods often introduce significant computational overhead or complex processing pipelines. Moreover, the integration of multi-scale spa-tial features and temporal features into a unified framework remains chal-lenging, making it difficult to process both small and large objects simultane-ously. To address these issues and enhance detection efficiency, we propose a novel method that aggregates multi-scale spatial features and contextual temporal information. Specifically, we propose a strip attention mechanism for intra-scale feature interaction, utilize pyramid network to fuse spatial fea-tures across scales and construct temporal associations across video frames through decoder structures. Our end-to-end approach aggregates target que-ries progressively from coarse to fine, striking a balance between perfor-mance and efficiency. Extensive experiments on the ImageNet VID dataset demonstrate that our method significantly improves video object detection.
📄 View Full Paper (PDF) 📋 Show Citation