WIDDAS: a Word-Importance-Distribution-based Detection method against Word-Level Adversarial Sample

Authors: Xiangge Li, Hong Luo and Yan Sun
Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages: 70-86
Keywords: Natural Language Processing, Adversarial Samples, Textual Defense, Adversarial Detection, Model Robustness

Abstract

Deep neural networks are facing security threats from adversarial samples, and even the most advanced large-scale language models are still vulnerable to adversarial attacks. Moreover, existing defense methods against adversarial attacks suffer from issues such as low accuracy in detection, too much false detection of clean data, and high defense costs. Therefore, in this paper, we propose WIDDAS: a Word-Importance-Distribution-based Detection method against Word-Level Adversarial Samples . It comprises a detection module and an evaluation module. The detection module swiftly identifies potential adversarial samples based on the word importance distribution of the input text. Then the evaluation module attempts to restore those samples and evaluates whether they are adversarial, thereby filtering out clean data which is non-adversarial. Experimental results demonstrate that WIDDAS outperforms the baselines in terms of both detection accuracy for adversarial samples and clean data. Particularly in scenario of Chinese data, the detection accuracy is at least 1.2% higher than the best baseline
📄 View Full Paper (PDF) 📋 Show Citation