大语言模型赋能学生译文智能评估的实证研究
网络出版日期: 2025-11-07
An Empirical Study on AI-Driven Evaluation of Student Translations Using Large Language Models
本研究聚焦大语言模型在翻译教学中的应用,系统考察其在学生译文质量评估中的效能与局限。研究采用人工翻译质量评价标准,构建量化为主、质性为辅的双层分析框架,整合了人工评分、模型评分及评语数据。量化结果表明,大语言模型在英译汉任务的结构化维度表现稳健,但在汉译英的语义与文化维度一致性显著下降,暴露其在深层语义理解与文化适配方面的不足。质性分析进一步揭示模型生成的评语存在模板化、错误归因与创造性排斥等问题,与量化结果相互印证。基于此,研究提出人机协同的教学实践路径,强调大语言模型应作为结构化校验的辅助工具,由教师主导语义与文化评估。研究为智能化翻译评估的教育应用提供了实证支持,推动大语言模型从工具性辅助向认知协作的角色转型。
张静 , 彭思锐 . 大语言模型赋能学生译文智能评估的实证研究[J]. 当代外语研究, 2025 , 25(5) : 85 -96 . DOI: 10.3969/j.issn.1674-8921.2025.05.009
This study investigates the application of large language models (LLMs) in translation teaching, focusing on their effectiveness and limitations in assessing student translations. Using established standards for human translation quality evaluation, a two-tier analytical framework combining quantitative and qualitative analyses was developed, incorporating human scores, LLM-generated scores, and evaluative comments. Quantitative results indicated that the LLM performed reliably in structural dimensions of Chinese-to-English tasks, but a marked decline was observed in semantic and cultural dimensions in English-to-Chinese tasks, exposing weaknesses in deep semantic understanding and cultural adaptation. Qualitative analysis further revealed issues such as templated feedback, misattribution of errors, and rejection of creative translation, thereby corroborating the quantitative findings. Based on these results, the study proposes a human-AI collaborative pathway for teaching practice, highlighting the role of LLMs as auxiliary tools for structural checking while reserving semantic and cultural evaluation for teachers. The findings provide theoretical grounding and pedagogical implications for the integration of intelligent assessment in translation education, and suggest a shift in LLMs’ role from instrumental support to cognitive collaboration.
| [1] | Cicchetti, D. V. 1994. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology[J]. Psychological Assessment 6(4): 284-290. |
| [2] | Colina, S. 2008. Translation quality evaluation: Empirical evidence for a functionalist approach[J/OL]. The Translator 14(1): 97-134. https://doi.org/10.1080/13556509.2008.10799251 |
| [3] | Colina, S. 2009. Further evidence for a functionalist approach to translation quality evaluation[J/OL]. Target 21(2): 235-264. https://doi.org/10.1075/target.21.2.02col |
| [4] | Cui, Y. & M. Liang. 2024. Automated scoring of translations with BERT models: Chinese and English language case study[J]. Applied Sciences 14(5): 1925. |
| [5] | Fernandes, P., D. Deutsch, M. Finkelstein, et al. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation[R]. Singapore, Singapore. Proceedings of the Eighth Conference on Machine Translation (WMT): 1066-1083. |
| [6] | Gong, M. 2025. The neural network algorithm-based quality assessment method for university English translation[J]. Network: Computation in Neural Systems 36(3): 649-661. |
| [7] | Han, C. & X. Lu. 2023. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?[J]. Computer Assisted Language Learning 36(5-6): 1064-1087. |
| [8] | Han, C. 2025. Quality assessment in multilingual, multimodal, and multiagent translation and interpreting (QAM3 T&I): Proposing a unifying framework for research[J]. Interpreting and Society 5(1): 27-55. |
| [9] | Huang, X., Z. Zhang, X. Geng, et al. 2024. Lost in the source language: How large language models evaluate the quality of machine translation[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 3546-3562. |
| [10] | Kocmi, T. & C. Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality[R]. Tampere, Finland. Proceedings of the 24th Annual Conference of the European Association for Machine Translation:193-203. https://arxiv.org/abs/2302.14520. |
| [11] | Koo, T. K. & M. Y. Li. 2016. A Guideline of selecting and reporting intraclass correlation coefficients for reliability research[J]. Journal of Chiropractic Medicine 15(2): 155-163. |
| [12] | Lu, Q., B. Qiu, L. Ding, et al. 2023. Error analysis prompting enables human-like translation evaluation in large language models: A case study on ChatGPT[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 8801-8816. |
| [13] | Qian, S., A. Sindhujan, M. Kabra, et al. 2024. What do large language models need for machine translation evaluation?[R]. Miami, Florida, USA. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: 3660-3674. https://arxiv.org/abs/2410.03278. |
| [14] | Williams, M. 2004. Translation Quality Assessment: An Argumentation-Centred Approach[M]. Ottawa: University of Ottawa Press. |
| [15] | Yang, H., M. Zhang, et al. 2003. Teachersim: Cross-lingual machine translation evaluation with monolingual embedding as teacher[R]. Pyeongchang, South Korea. 2023 25th International Conference on Advanced Communication Technology (ICACT):283-287. |
| [16] | 郜洁. 2025. 生成式人工智能的双刃剑效应——DeepSeek在外语教育领域的应用优势与潜在风险探析[J]. 当代外语研究 (3): 140-151. |
| [17] | 江进林. 2013. 英译汉语言质量自动量化研究[J]. 现代外语 36(1): 85-91,110. |
| [18] | 江进林、 文秋芳. 2012. 大规模测试中学生英译汉机器评分模型的构建[J]. 外语电化教学 (02): 3-8. |
| [19] | 李晶洁、 陈秋燕. 2025. 人机协同智能写作发展演进分析与启示[J]. 当代外语研究 (1): 73-83. |
| [20] | 王金铨. 2008. 中国学生汉译英机助评分模型的研究与构建[D]. 北京: 北京外国语大学. |
| [21] | 王金铨、 朱周晔. 2017. 汉译英翻译能力自动评价研究[J]. 中国外语 14(2): 66-71. |
| [22] | 王巍巍、 王轲、 张昱琪. 2022. 基于CSE口译量表的口译自动评分路径探索[J]. 外语界(2): 80-87. |
| [23] | 袁煜. 2016. 翻译质量自动评估特征集[J]. 外语教学与研究48(5): 776-787,801. |
| [24] | 张静. 2024. 生成式人工智能背景下翻译高阶思维教学模式构建[J]. 中国翻译 (3): 71-80. |
/
| 〈 |
|
〉 |