LLM Evaluation Panel Selection Based on Cross Assessment and Similarity Matrix

Authors: Zhixiang Yang, Rongduo Han, Xiaoteng Pan, Liming Kang, Meiping Wang, Nan Gao, and Haining Zhang
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 805-820
Keywords: Large language models, Model performance evaluation, Multi-model scoring, Cross Assessment, Similarity Matrix

Abstract

The evaluation of Large Language Models LLMs has become increasingly crucial as these models continue to advance in capabilities and complexity. The current evaluation methods primarily rely on lexical metrics and single-model scoring systems, falling short on comprehensively and accurately assessing the capabilities and performance of LLMs in semantic understanding and logical reasoning, which presents a significant challenge in developing reliable and trustworthy assessment frameworks. The contributions of the study are as follows. First, it introduces an automated approach that combines cross-model evaluation mechanisms with similarity analysis to systematically select members for the multi-model evaluation panel. Second, it validate the effectiveness of it's methodology using expert-annotated evaluation data. Experimental results demonstrate that the multi-model evaluation panel approach achieves noticeable improvement in scoring consistency versus human evaluation as compared to single-model approach.
📄 View Full Paper (PDF) 📋 Show Citation