Korean AI Models Score Far Below Global Rivals on Math Tests

Large language models (LLMs) developed by Korean teams challenging for the nation's flagship artificial intelligence initiative have performed significantly worse than foreign models on college entrance math and essay problems, a new study shows.
According to IT industry sources on Wednesday, a research team led by Professor Kim Jong-rak of Sogang University's mathematics department tested major LLMs from five Korean teams competing in the national AI challenge, along with five foreign models including ChatGPT, on 20 math problems from Korea's College Scholastic Ability Test (CSAT) and 30 essay questions.
For the CSAT portion, researchers selected the five most difficult questions from each of four sections: common subjects, probability and statistics, calculus, and geometry. The 30 essay questions were drawn from past exams at 10 Korean universities, 10 Indian university entrance problems, and 10 math problems from the University of Tokyo's engineering graduate school entrance exam.
The Korean models tested were Upstage's Solar Pro-2, LG AI Research's Exaone 4.0.1, Naver's HCX-007, SK Telecom's A.X 4.0 (72B), and NCSoft's lightweight model Llama Varco 8B Instruct. Foreign models included GPT-5.1, Gemini 3 Pro Preview, Claude Opus 4.5, Grok 4.1 Fast, and DeepSeek V3.2.
Foreign models scored between 76 and 92 points, while among Korean models, only Solar Pro-2 achieved 58 points. The remaining Korean models recorded scores in the low 20s, with Llama Varco 8B Instruct receiving the lowest score of 2 points.
The research team noted that the five Korean models were designed to use Python as a tool to improve problem-solving accuracy because they could not solve most problems through simple reasoning alone, yet still produced these results.
The researchers also tested the 10 models on 10 problems from their proprietary problem set called "EntropyMath," which contains 100 questions with difficulty levels ranging from undergraduate to professor-level research. Foreign models scored between 82.8 and 90 points, while Korean models scored between 7.1 and 53.3 points.
When using a methodology that allowed three attempts to arrive at the correct answer, Grok achieved a perfect score and the other foreign models scored 90 points. Among Korean models, Solar Pro-2 scored 70 points, Exaone 60 points, HCX-007 40 points, A.X 4.0 30 points, and Llama Varco 8B Instruct 20 points.
"There were many inquiries about why there was no evaluation of the five domestic sovereign AI models on CSAT problems, so we conducted tests with our team members," Professor Kim said. "We could see that domestic models are significantly behind foreign frontier models."
An industry official noted that "many domestic LLM models are not equipped with reasoning mode," adding that "the comparison criteria do not match, making it difficult to generalize the results." The interpretation is that while reasoning capabilities are essential for solving math problems, only some models are equipped with reasoning mode, making direct comparisons problematic.
The research team plans to retest performance using their proprietary problems when each team releases new versions of their national AI models, as the five Korean models used in this study were existing public versions.
"We have established a math leaderboard based on the EntropyMath dataset and will develop it to an international level," Professor Kim said. "We will improve our proprietary problem generation algorithms and pipelines to create datasets not only for mathematics but also for science, manufacturing, and cultural domains to contribute to improving domain-specific model performance."
