本数据集包含2800条西语文化知识点以及11000道评测题目,基于交叉学科研究,提出了多层次的文化能力评测维度体系,旨在全面准确评估大语言模型在特定文化背景中知识掌握、偏见识别与情景应用能力
模型/准确率 | Overall | 客观题 | 主观题 | Geography & Customs | Personal Choices & Habits | Regulation & Policy | Social Relationship & Structures | Values & Beliefs |
Deepseek-R1 | 0.845 | 0.833 | 0.909 | 0.733 | 0.851 | 0.810 | 0.938 | 0.895 |
Qwen2.5-14B | 0.870 | 0.882 | 0.826 | 0.883 | 0.851 | 0.840 | 0.958 | 0.908 |
LlaMa3-8B | 0.800 | 0.845 | 0.636 | 0.783 | 0.786 | 0.760 | 0.938 | 0.829 |
模型/准确率 | Overall | 客观题 | 主观题 | Geography & Customs | Personal Choices & Habits | Regulation & Policy | Social Relationship & Structures | Values & Beliefs |
Deepseek-R1 | 0.664 | 0.526 | 0.857 | 0.810 | 0.629 | 0.586 | 0.729 | 0.737 |
Qwen2.5-14B | 0.745 | 0.745 | 0.745 | 0.767 | 0.728 | 0.730 | 0.771 | 0.790 |
LlaMa3-8B | 0.563 | 0.611 | 0.494 | 0.567 | 0.554 | 0.480 | 0.625 | 0.658 |