An Empirical Study on AI-Driven Evaluation of Student Translations Using Large Language Models

  • ZHANG Jing ,
  • PENG Sirui
Expand

Online published: 2025-11-07

Abstract

This study investigates the application of large language models (LLMs) in translation teaching, focusing on their effectiveness and limitations in assessing student translations. Using established standards for human translation quality evaluation, a two-tier analytical framework combining quantitative and qualitative analyses was developed, incorporating human scores, LLM-generated scores, and evaluative comments. Quantitative results indicated that the LLM performed reliably in structural dimensions of Chinese-to-English tasks, but a marked decline was observed in semantic and cultural dimensions in English-to-Chinese tasks, exposing weaknesses in deep semantic understanding and cultural adaptation. Qualitative analysis further revealed issues such as templated feedback, misattribution of errors, and rejection of creative translation, thereby corroborating the quantitative findings. Based on these results, the study proposes a human-AI collaborative pathway for teaching practice, highlighting the role of LLMs as auxiliary tools for structural checking while reserving semantic and cultural evaluation for teachers. The findings provide theoretical grounding and pedagogical implications for the integration of intelligent assessment in translation education, and suggest a shift in LLMs’ role from instrumental support to cognitive collaboration.

Cite this article

ZHANG Jing , PENG Sirui . An Empirical Study on AI-Driven Evaluation of Student Translations Using Large Language Models[J]. Contemporary Foreign Languages Studies, 2025 , 25(5) : 85 -96 . DOI: 10.3969/j.issn.1674-8921.2025.05.009

References

[1] Cicchetti, D. V. 1994. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology[J]. Psychological Assessment 6(4): 284-290.
[2] Colina, S. 2008. Translation quality evaluation: Empirical evidence for a functionalist approach[J/OL]. The Translator 14(1): 97-134. https://doi.org/10.1080/13556509.2008.10799251
[3] Colina, S. 2009. Further evidence for a functionalist approach to translation quality evaluation[J/OL]. Target 21(2): 235-264. https://doi.org/10.1075/target.21.2.02col
[4] Cui, Y. & M. Liang. 2024. Automated scoring of translations with BERT models: Chinese and English language case study[J]. Applied Sciences 14(5): 1925.
[5] Fernandes, P., D. Deutsch, M. Finkelstein, et al. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation[R]. Singapore, Singapore. Proceedings of the Eighth Conference on Machine Translation (WMT): 1066-1083.
[6] Gong, M. 2025. The neural network algorithm-based quality assessment method for university English translation[J]. Network: Computation in Neural Systems 36(3): 649-661.
[7] Han, C. & X. Lu. 2023. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?[J]. Computer Assisted Language Learning 36(5-6): 1064-1087.
[8] Han, C. 2025. Quality assessment in multilingual, multimodal, and multiagent translation and interpreting (QAM3 T&I): Proposing a unifying framework for research[J]. Interpreting and Society 5(1): 27-55.
[9] Huang, X., Z. Zhang, X. Geng, et al. 2024. Lost in the source language: How large language models evaluate the quality of machine translation[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 3546-3562.
[10] Kocmi, T. & C. Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality[R]. Tampere, Finland. Proceedings of the 24th Annual Conference of the European Association for Machine Translation:193-203. https://arxiv.org/abs/2302.14520.
[11] Koo, T. K. & M. Y. Li. 2016. A Guideline of selecting and reporting intraclass correlation coefficients for reliability research[J]. Journal of Chiropractic Medicine 15(2): 155-163.
[12] Lu, Q., B. Qiu, L. Ding, et al. 2023. Error analysis prompting enables human-like translation evaluation in large language models: A case study on ChatGPT[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 8801-8816.
[13] Qian, S., A. Sindhujan, M. Kabra, et al. 2024. What do large language models need for machine translation evaluation?[R]. Miami, Florida, USA. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: 3660-3674. https://arxiv.org/abs/2410.03278.
[14] Williams, M. 2004. Translation Quality Assessment: An Argumentation-Centred Approach[M]. Ottawa: University of Ottawa Press.
[15] Yang, H., M. Zhang, et al. 2003. Teachersim: Cross-lingual machine translation evaluation with monolingual embedding as teacher[R]. Pyeongchang, South Korea. 2023 25th International Conference on Advanced Communication Technology (ICACT):283-287.
[16] 郜洁. 2025. 生成式人工智能的双刃剑效应——DeepSeek在外语教育领域的应用优势与潜在风险探析[J]. 当代外语研究 (3): 140-151.
[17] 江进林. 2013. 英译汉语言质量自动量化研究[J]. 现代外语 36(1): 85-91,110.
[18] 江进林、 文秋芳. 2012. 大规模测试中学生英译汉机器评分模型的构建[J]. 外语电化教学 (02): 3-8.
[19] 李晶洁、 陈秋燕. 2025. 人机协同智能写作发展演进分析与启示[J]. 当代外语研究 (1): 73-83.
[20] 王金铨. 2008. 中国学生汉译英机助评分模型的研究与构建[D]. 北京: 北京外国语大学.
[21] 王金铨、 朱周晔. 2017. 汉译英翻译能力自动评价研究[J]. 中国外语 14(2): 66-71.
[22] 王巍巍、 王轲、 张昱琪. 2022. 基于CSE口译量表的口译自动评分路径探索[J]. 外语界(2): 80-87.
[23] 袁煜. 2016. 翻译质量自动评估特征集[J]. 外语教学与研究48(5): 776-787,801.
[24] 张静. 2024. 生成式人工智能背景下翻译高阶思维教学模式构建[J]. 中国翻译 (3): 71-80.
Outlines

/