GSM-AIV: Exposing the Fragile Boundaries of Mathematical Reasoning in LLMs through Contextual Recomposition

Authors: Hualing Liu, Zilong Zhang, Yingxin Hong, Yiwei Guo, Shiqin Gong, and Mengkai Wang
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 1111-1121
Keywords: Large Language Model, Natural Language Processing, Mathematical Reasoning.

Abstract

This study reveals the semantic contextual vulnerability of large language models in solving mathematical problems. By constructing GSM-AIV, an Algebraic Isomorphic Variants dataset of GSM8K, we find that large language models such as Mistral-7b have an average accuracy improvement of 5.65 percentage points under the condition of retaining the mathematical logical chain but changing the problem context. This phenomenon implies that mathematical reasoning in LLMs relies heavily on surface semantic patterns rather than deep mathematical understanding. We further propose the template-induced bias and attention entropy reduction hypotheses to argue for the phenomenon of loose coupling between the mathematical reasoning ability of the models and the semantic scenarios, which provides a new theoretical perspective on the design of evaluation frameworks.
📄 View Full Paper (PDF) 📋 Show Citation