An Empirical Study on AI-Driven Evaluation of Student Translations Using Large Language Models

doi:10.3969/j.issn.1674-8921.2025.05.009

Abstract

Abstract:

This study investigates the application of large language models (LLMs) in translation teaching, focusing on their effectiveness and limitations in assessing student translations. Using established standards for human translation quality evaluation, a two-tier analytical framework combining quantitative and qualitative analyses was developed, incorporating human scores, LLM-generated scores, and evaluative comments. Quantitative results indicated that the LLM performed reliably in structural dimensions of Chinese-to-English tasks, but a marked decline was observed in semantic and cultural dimensions in English-to-Chinese tasks, exposing weaknesses in deep semantic understanding and cultural adaptation. Qualitative analysis further revealed issues such as templated feedback, misattribution of errors, and rejection of creative translation, thereby corroborating the quantitative findings. Based on these results, the study proposes a human-AI collaborative pathway for teaching practice, highlighting the role of LLMs as auxiliary tools for structural checking while reserving semantic and cultural evaluation for teachers. The findings provide theoretical grounding and pedagogical implications for the integration of intelligent assessment in translation education, and suggest a shift in LLMs’ role from instrumental support to cognitive collaboration.

Key words: large language model (LLMs), translation teaching, translation quality assessment, human-AI collaboration, cultural adaptation

CLC Number:

H059

ZHANG Jing, PENG Sirui. An Empirical Study on AI-Driven Evaluation of Student Translations Using Large Language Models[J]. Contemporary Foreign Languages Studies, 2025, 25(5): 85-96.

Figures/Tables 3

References 24

[1]	Cicchetti, D. V. 1994. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology[J]. Psychological Assessment 6(4): 284-290.
[2]	Colina, S. 2008. Translation quality evaluation: Empirical evidence for a functionalist approach[J/OL]. The Translator 14(1): 97-134. https://doi.org/10.1080/13556509.2008.10799251
[3]	Colina, S. 2009. Further evidence for a functionalist approach to translation quality evaluation[J/OL]. Target 21(2): 235-264. https://doi.org/10.1075/target.21.2.02col
[4]	Cui, Y. & M. Liang. 2024. Automated scoring of translations with BERT models: Chinese and English language case study[J]. Applied Sciences 14(5): 1925.
[5]	Fernandes, P., D. Deutsch, M. Finkelstein, et al. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation[R]. Singapore, Singapore. Proceedings of the Eighth Conference on Machine Translation (WMT): 1066-1083.
[6]	Gong, M. 2025. The neural network algorithm-based quality assessment method for university English translation[J]. Network: Computation in Neural Systems 36(3): 649-661.
[7]	Han, C. & X. Lu. 2023. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?[J]. Computer Assisted Language Learning 36(5-6): 1064-1087.
[8]	Han, C. 2025. Quality assessment in multilingual, multimodal, and multiagent translation and interpreting (QAM3 T&I): Proposing a unifying framework for research[J]. Interpreting and Society 5(1): 27-55.
[9]	Huang, X., Z. Zhang, X. Geng, et al. 2024. Lost in the source language: How large language models evaluate the quality of machine translation[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 3546-3562.
[10]	Kocmi, T. & C. Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality[R]. Tampere, Finland. Proceedings of the 24th Annual Conference of the European Association for Machine Translation:193-203. https://arxiv.org/abs/2302.14520.
[11]	Koo, T. K. & M. Y. Li. 2016. A Guideline of selecting and reporting intraclass correlation coefficients for reliability research[J]. Journal of Chiropractic Medicine 15(2): 155-163.
[12]	Lu, Q., B. Qiu, L. Ding, et al. 2023. Error analysis prompting enables human-like translation evaluation in large language models: A case study on ChatGPT[R]. Bangkok, Thailand. Findings of the Association for Computational Linguistics: ACL 2024: 8801-8816.
[13]	Qian, S., A. Sindhujan, M. Kabra, et al. 2024. What do large language models need for machine translation evaluation?[R]. Miami, Florida, USA. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: 3660-3674. https://arxiv.org/abs/2410.03278.
[14]	Williams, M. 2004. Translation Quality Assessment: An Argumentation-Centred Approach[M]. Ottawa: University of Ottawa Press.
[15]	Yang, H., M. Zhang, et al. 2003. Teachersim: Cross-lingual machine translation evaluation with monolingual embedding as teacher[R]. Pyeongchang, South Korea. 2023 25th International Conference on Advanced Communication Technology (ICACT):283-287.
[16]	郜洁. 2025. 生成式人工智能的双刃剑效应——DeepSeek在外语教育领域的应用优势与潜在风险探析[J]. 当代外语研究 (3): 140-151.
[17]	江进林. 2013. 英译汉语言质量自动量化研究[J]. 现代外语 36(1): 85-91,110.
[18]	江进林、文秋芳. 2012. 大规模测试中学生英译汉机器评分模型的构建[J]. 外语电化教学 (02): 3-8.
[19]	李晶洁、陈秋燕. 2025. 人机协同智能写作发展演进分析与启示[J]. 当代外语研究 (1): 73-83.
[20]	王金铨. 2008. 中国学生汉译英机助评分模型的研究与构建[D]. 北京: 北京外国语大学.
[21]	王金铨、朱周晔. 2017. 汉译英翻译能力自动评价研究[J]. 中国外语 14(2): 66-71.
[22]	王巍巍、王轲、张昱琪. 2022. 基于CSE口译量表的口译自动评分路径探索[J]. 外语界(2): 80-87.
[23]	袁煜. 2016. 翻译质量自动评估特征集[J]. 外语教学与研究48(5): 776-787,801.
[24]	张静. 2024. 生成式人工智能背景下翻译高阶思维教学模式构建[J]. 中国翻译 (3): 71-80.

模块	构成要素	具体内容示例
身份定义	角色定位	作为专业翻译质量评估系统,您需具备语言学与翻译学双重专业背景,严格依据既定标准执行多维度质量评测
任务说明	评估维度	自然度与清晰度、文化术语准确性、语法规范性、意图忠实性、内容完整性(上传各维度具体指标描述)
	评分规则	各维度10分制(0—10分),每2分差代表显著水平差异,总分加权求和
	操作流程	1.文化术语定位→词表比对→错误计数
		2.语法检测→错误分类加权
		3.意图分析→语义相似度计算
		4.信息完整性→双向匹配验证
		(上传词表和参考译文)
能力要求	核心技能	·跨文化交际能力
		·语法错误模式识别
		·术语数据库检索
	技术工具	·预训练语言模型(流畅度分析)
		·句法分析器(复杂度检测)
		·词向量模型(词汇适切性)
		·命名实体识别(事实准确性)
操作限制	约束条件	·单次错误不重复扣分
		·文化术语以提供词表为基准,但不限于词表
		·必须输出扣分依据与改进建议
		·禁止任何创造性解释

维度	ICC值			平均标准差			偏度/峰度
维度	英→汉		汉→英	英→汉		汉→英	英→汉		汉→英
自然与清晰	0.67	0.58		0.46	0.45		1.333333	-0.97368
语法与词汇	0.65	0.67		0.85	0.68		0.907407	0.088889
意图与背景	0.61^c	0.53^c		0.56	0.52		0.04/-0.04	0.01/-0.47
内容与信息	0.45^c	0.37^c		0.83	0.66		0.333333	0.11/0.29
文化与术语	0.67^c	0.60^c		0.84	0.59		0.918919	-1

维度	方向	绝对一致性			趋势关联
维度	方向	ICC值	95% CI	显著性(p)	相关性(r)	显著性(p)
1.自然与清晰	英→汉	0.695^c	[0.44, 0.83]	<0.01**	0.590**	<0.001
	汉→英	0.224^c	[-0.17, 0.52]	<0.01**	0.341**	0.005
2.语法与词汇	英→汉	0.537^c	[0.25, 0.71]	<0.01**	0.512**	<0.001
	汉→英	0.204^c	[-0.16, 0.47]	0.1	0.199	0.106
3.意图与背景	英→汉	0.255^c	[-0.18, 0.56]	<0.01**	0.399**	0.001
	汉→英	-0.019^c	[-0.12, 0.11]	0.63	-0.046	0.713
4.内容与信息	英→汉	0.632^c	[0.40, 0.77]	<0.01**	0.498**	<0.001
	汉→英	0.323^c	[-0.16, 0.61]	<0.01**	0.348**	0.004
5.文化与术语	英→汉	0.802^c	[0.67, 0.88]	<0.01**	0.692**	<0.001
	汉→英	0.598^c	[0.35, 0.75]	<0.01**	0.434**	<0.001
总分	英→汉	0.811^c	[0.69, 0.88]	<0.01**	0.733**	<0.001
	汉→英	0.257^c	[-0.16, 0.54]	p=0.01*	0.339**	0.005