Hide Your Weaknesses from Attackers: A Defense Method against Black-Box Adversarial Text Attacks
Authors:
Kun Li; MinXuan Li; Wei Zhou;
Conference:
ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages:
259-269
Keywords:
Natural language processing and Adversarial attack and Adversarial defense.
Abstract
Despite the significant successes of LLMs in generative tasks, the current preference in classification scenarios predominantly leans towards the use of pre-trained language models(PLM), taking into account the balance between cost and effectiveness. However, these models are prone to manipulation through black-box adversarial text attacks, where attackers modify texts in subtle ways to deceive the models. Typically, attackers follow a two-step process: first, they identify crucial sentence elements for the model, then they alter these elements by replacing, deleting, or adding words or characters.
Previous research has mainly focused on creating adversarial examples for training, aiming to improve model resilience. These efforts often overlook defenses against the initial phase of identifying vulnerable targets. This paper introduces a defensive strategy against these first-stage attacks, leveraging concepts from differential privacy. We propose a novel approach, extbf{Mask Regeneration}, which conceals the targets using a [Mask] token and employs a Mask Language Model (MLM) to generate misleading samples. Additionally, we observe that key targets often align with high attention values in the model. Based on this insight, we introduce an extbf{Attention Shuffle} tactic, which randomizes the top-k attention values at each transformer layer, further disorienting attackers.
The experiment shows that our defense method achieves better robustness gains than the State-of-the-art under three strong adversarial attacks for three typical NLP tasks, like sentiment analysis, textual entailment, and topic classification. Moreover, it is also demonstrated that the attack cost significantly increases when attacking our defense model.
BibTeX Citation:
@inproceedings{ICIC2024,
author = {Kun Li; MinXuan Li; Wei Zhou;},
title = {Hide Your Weaknesses from Attackers: A Defense Method against Black-Box Adversarial Text Attacks},
booktitle = {Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024)},
month = {August},
date = {5-8},
year = {2024},
address = {Tianjin, China},
pages = {259-269},
note = {Poster Volume â…¡}
doi = {
10.65286/icic.v20i2.98583}
}