Muti-Scale Encoder and Temporal Queries Decoder for Video Object Detection
Authors:
Hongxiao Yang and Xi Chen
Conference:
ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages:
3090-3106
Keywords:
video object detection, temporal queries, muti-scale features, transformer
Abstract
Video Object Detection VOD leverages temporal information across adja-cent frames in video datasets, enabling the identification and localization be-yond single-frame image object detection. Transformer-based detectors have achieved remarkable performance in static image object detection. However, their application to video object detection lacks sufficient exploration, par-ticularly in aggregating spatial features and temporal features effectively. Re-cent research has replaced handcrafted components in traditional optical flow models and association networks with novel designs to integrate spatial features across frames, thereby incorporating temporal information. Never-theless, these methods often introduce significant computational overhead or complex processing pipelines. Moreover, the integration of multi-scale spa-tial features and temporal features into a unified framework remains chal-lenging, making it difficult to process both small and large objects simultane-ously. To address these issues and enhance detection efficiency, we propose a novel method that aggregates multi-scale spatial features and contextual temporal information. Specifically, we propose a strip attention mechanism for intra-scale feature interaction, utilize pyramid network to fuse spatial fea-tures across scales and construct temporal associations across video frames through decoder structures. Our end-to-end approach aggregates target que-ries progressively from coarse to fine, striking a balance between perfor-mance and efficiency. Extensive experiments on the ImageNet VID dataset demonstrate that our method significantly improves video object detection.
BibTeX Citation:
@inproceedings{ICIC2025,
author = {Hongxiao Yang and Xi Chen},
title = {Muti-Scale Encoder and Temporal Queries Decoder for Video Object Detection},
booktitle = {Proceedings of the 21st International Conference on Intelligent Computing (ICIC 2025)},
month = {July},
date = {26-29},
year = {2025},
address = {Ningbo, China},
pages = {3090-3106},
}