
South Korea is launching a project to convert labeling data centered on discriminative AI into data for generative AI, in step with rapidly evolving artificial intelligence technology.
The Ministry of Science and ICT (MSIT) and the National Information Society Agency announced Tuesday that the public notice for the "AI Training Data Upcycling" project, which reprocesses AI training data previously provided through the AI Hub to fit the latest technology environment, will begin on the 30th of this month.
The project will convert a total of 30 datasets—15 each in the large language model (LLM) and physical AI fields—with a total investment of 3 billion won ($2.2 million). The government assessed that reprocessing training data delivers greater policy impact relative to budget input than building new datasets from scratch.
In selecting the datasets to be reprocessed, the government conducted a full review of 691 datasets built in the AI Hub through 2022 and identified those with the highest potential for expansion into generative AI data and the greatest utilization value.

In the LLM data field, existing text data will be restructured to include reasoning processes spanning question, evidence review, error verification, and answer confirmation.
Through this, the government plans to expand the data beyond presenting a single correct answer into a form that enables learning of diverse judgment paths and self-verification processes. In particular, it will build a training foundation for reasoning-type AI capable of solving complex problems by constructing multiple reasoning paths for the same question and incorporating evidence-based judgment and error correction processes.
In the physical AI field, existing image and video data will be upgraded into a structure integrating visual information (V), language commands (L), and action and control (A).
This will expand the data beyond object recognition into a form that understands situational changes over time and interactions between objects, and can generate goal-based actions. In particular, the data will be restructured to define action paths and task objectives by leveraging continuous scene information and object movement data.
The upcycled data will later be released through the AI Hub for free use by companies, research institutions, and startups.
"Through this upcycling project, we will be able to secure AI training data suited to the latest generative AI technology environment at low cost," Choi Dong-won, director general for AI infrastructure policy at the MSIT, said Tuesday. "We will raise the utilization value of already accumulated data assets so that they are not wasted."




