Multimodal Chinese Event Detection on Vision-Language Pre-training and Glyphs

Abstract

When using visual information to complement textual data for event extraction, current approaches primarily focus on processing text and images independently using different pre-trained uni-models and then fusing the feature information from different modalities. However, pre-training and fine-tuning schemes have been extended to the joint domain of vision and language, leading to the development of vision-language pre-trained models (VLPs). These models are extensively trained on text and its corresponding images and then fine-tuned for vision-language tasks. In this paper, we propose a method for event detection in Chinese glyphs and VLP models. Since Chinese characters are hieroglyphs, some radical features of the trigger words play a certain and auxiliary role in the detection of text trigger words. We convert the text in the ACE Chinese corpus into text images, and transport the text and images into the Vision-Language model to obtain multimodal features for event detection. Experimental results on the ACE 2005 Chinese corpus show that our proposed model outperforms the SOTA baselines.

BibTeX Citation:

@inproceedings{ICIC2024,
    author = {qianqian Si, Zhongqing Wang, Peifeng Li},
    title = {Multimodal Chinese Event Detection on Vision-Language Pre-training and Glyphs},
    booktitle = {Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024)},
    month = {August},
    date = {5-8},
    year = {2024},
    address = {Tianjin, China},
    pages = {270-281},
    note = {Poster Volume Ⅱ}
    doi = {10.65286/icic.v20i2.65224}
}