PAF-DM: Proposal Alignment Framework for Multimodal Event Extraction via Dynamic Masking

Authors: Hengrui Song and Chun Yuan
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 3354-3367
Keywords: Multimodal Event Extraction, Cross-modal Data Augmentation, Proposal, Dynamic Masking.

Abstract

Multimodal event extraction MEE aims to identify and classify event triggers and arguments by jointly modeling textual and visual information. While recent methods have shown promising progress, they often suffer from two key limitations: coarse-grained modality alignment and the scarcity of annotated multimodal data. To alleviate data scarcity, cross-modal data augmentation techniques—such as text-to-image and image-to-text generation—have been explored. However, synthetic data may introduce noise, including hallucinations and artifacts, which can negatively impact model performance. In this work, we propose PAF-DM, a novel framework for MEE that addresses both challenges through a proposal-based alignment paradigm and a dynamic masking strategy. Specifically, we incorporate a Q-Former architecture to achieve fine-grained alignment based on proposals between event-related elements across modalities, and also introduce a three-dimensional dynamic masking mechanism to reduce over-reliance on low-quality synthetic data. Experimental results on the M2E2 benchmark demonstrate that our approach achieves state-of-the-art performance and offers a robust solution for leveraging cross-modal data in MEE.
📄 View Full Paper (PDF) 📋 Show Citation