Kakao's AI 'Kanana-o' Outperforms ChatGPT in Korean Language Tasks

Kakao's (035720.KS) integrated multimodal language model "Kanana-o" has surpassed the performance of ChatGPT-4o in Korean speech recognition and reasoning, the company said. The improvement follows continuous performance enhancements since the model's initial launch.

Kakao on Monday unveiled the development process and performance results of its multimodal language model Kanana-o and multimodal embedding model "Kanana-v-embedding" through its tech blog. According to benchmark evaluations disclosed by the company, Kanana-o showed similar performance to GPT-4o in English speech capabilities, while demonstrating significantly higher performance in Korean speech recognition, synthesis, and emotion recognition.

Kakao first introduced Kanana-o in May as an integrated multimodal language model capable of simultaneously understanding text, voice, and images while providing real-time responses. The company noted that existing multimodal models perform well with text input but tend to provide simpler responses with reduced reasoning ability during voice conversations. Kakao focused on improving instruction-following capabilities in Kanana-o to address this limitation.

The company explained that it trained the model with high-quality voice data, precisely teaching intonation, emotion, and breathing patterns. This enhanced the model's ability to express vivid emotions such as joy, sadness, anger, and fear depending on the situation, as well as emotional expressions based on subtle changes in tone and voice color.

"We plan to evolve the model to enable more natural full-duplex conversations and real-time generation of context-appropriate soundscapes," Kakao said.

Kanana-v-embedding, also unveiled Monday, is a Korean multimodal model capable of understanding and processing text and images simultaneously. It supports searching for images using text, retrieving information related to user-selected images, and searching documents containing images. Kakao said the model excels at understanding Korean proper nouns such as Gyeongbokgung Palace and bungeoppang, a traditional Korean fish-shaped pastry.

Kakao is currently applying Kanana-v-embedding internally to systems that analyze and review the similarity of advertising materials. The company plans to expand the scope to video and audio for application in more diverse services.

"We will focus on creating AI technology experiences in users' daily lives through actual service environments and implementing AI that can interact like humans," said Kim Byung-hak, Kanana Performance Leader at Kakao.

Kakao is also conducting research on lightweight multimodal models that can operate in on-device environments such as mobile devices. The company is preparing to develop "Kanana-2," a high-performance, high-efficiency model.