
KT has developed a multilingual benchmark that verifies whether artificial intelligence (AI) properly understands the cultural context and social norms of each country.
KT said Wednesday that it has unveiled "XL-SafetyBench," a benchmark that comprehensively evaluates the safety and cultural sensitivity of large language models (LLMs), together with global companies, public institutions, and academia. XL-SafetyBench is a multilingual prompt dataset totaling 5,500 entries that reflects the linguistic and cultural characteristics of 10 countries, including South Korea, the United States, Germany, Japan, Türkiye, and the United Arab Emirates (UAE). It measures how appropriately LLMs recognize and reflect the social norms and cultural sensitivities of each country. In particular, it was designed to precisely verify AI models' safety and cultural sensitivity awareness by reflecting cases in which the same expression or object can be interpreted with entirely different meanings depending on the cultural sphere.
Experts from global companies, public institutions, and academia participated in designing the benchmark. AI security company Aim Intelligence carried out research including building data that reflects actual attack patterns and designing the review process. Microsoft (MS) presented the need to evaluate safety and cultural sensitivity across diverse cultural and linguistic environments, drawing on its experience with global AI services. The Korea AI Safety Institute (AISI) proposed evaluation perspectives that reflect the laws, systems, and cultural characteristics of each country. A total of 10 institutions, including academic institutions such as the Technical University of Munich in Germany, Ankara University in Türkiye, and Seoul National University, along with 17 affiliated experts, also participated in designing the benchmark. KT utilized capabilities accumulated by its dedicated Responsible AI (RAI) organization, which is responsible for establishing standards to secure AI safety and reliability, building evaluation systems, and developing risk mitigation technologies, in this research.
The benchmark's evaluation code has been made available through Hugging Face, an AI model and data sharing platform, and GitHub, an open-source development collaboration platform, so that anyone can use it. The research team evaluated 37 major LLMs using the benchmark and also released a paper containing the analysis results on the open paper repository arXiv.






