Naver Replaces Chinese Component with In-House Vision Encoder

Excluding China's Qwen and Internalizing Proprietary Vision Encoder · Multimodal Architecture Linked to Korean Language and Culture, Accelerating Sovereign AI

Technology|
|
By Lee Jin-seok
||
Naver Cloud logo. Photo provided by Naver Cloud - Seoul Economic Daily Technology News from South Korea
Naver Cloud logo. Photo provided by Naver Cloud

Naver (035420.KS) is completely removing the Chinese vision encoder that sparked controversy during its independent artificial intelligence foundation model development project, and applying its own vision encoder across all of its AI models.

According to industry sources on Wednesday, Naver Cloud completed development of its proprietary vision encoder in early last month and has begun internalization work to apply it across all multimodal models it develops in the future.

A vision encoder is a module that converts image and video information into a form that AI can understand. In multimodal models that comprehensively handle text, image, audio, and video information, it serves as a kind of "optic nerve."

The newly developed vision encoder from Naver Cloud represents a substantial performance improvement over its existing in-house encoder technology, "VUClip." It appears to have secured performance on par with top-tier encoders such as China's Qwen, whose usability has been verified in the open-source ecosystem.

Earlier this year, Naver Cloud faced strong criticism when it partially borrowed the vision encoder and weights of Alibaba's Qwen 2.5 model for its multimodal model "HyperCLOVA X SEED 32B Think" while participating in the government-led independent foundation model project. Critics argued that this ran counter to the project's intent, which advocated a "From Scratch" principle of building with proprietary technology from the initial training stage.

At the time, Naver Cloud explained that "the vision encoder can be replaced at any time, and it is not a core area that cannot be replaced."

However, whether to replace the encoder in "HyperCLOVA X SEED 32B Think," which has already been released as open source, remains undecided.

The key feature of the newly developed vision encoder is its structure that trains the AI in Korean from the learning stage, directly linking images with the Korean language without a separate translation process. For example, while existing global encoders had limitations such as simply recognizing an image of Korea's "Harubang" as the English word "Statue," Naver's model directly recognizes the image as the Korean word "Harubang" upon viewing it.

Designed to directly connect Korean language and images, the model's differentiating factor is its ability to read the unique context of Korean culture without information distortion. Through this, Naver Cloud plans to strengthen its "Sovereign AI" strategy by utilizing the model to demonstrate unparalleled accuracy compared to foreign models when handling visual data involving Korean geography, culture, and proper nouns.

Related Video

AI-translated from Korean. Quotes from foreign sources are based on Korean-language reports and may not reflect exact original wording.

00:0004:56