Hancom's Open Data Loader PDF v2.0 Tops GitHub Trending

Technology|
|
By Lee Jin-seok
||
Hancom Open Data Loader PDF v2.0, #1 Trending on GitHub Open Source - Seoul Economic Daily Technology News from South Korea
Hancom Open Data Loader PDF v2.0, #1 Trending on GitHub Open Source

Hancom (030520) announced on June 23 that its open-source PDF data extraction project, Open Data Loader PDF v2.0, ranked No. 1 in GitHub's trending list across all programming languages on June 20, earning a trending badge from the world's largest open-source development platform.

The project achieved the milestone just one week after its release.

GitHub Trending is a real-time index tracking the open-source projects drawing the most attention from developers worldwide. Open Data Loader PDF v2.0 recorded more than 1,800 new GitHub stars on June 21 alone, surpassing 7,000 total stars and 500 forks — copies of the repository made by other developers for independent use.

The growth rate far exceeds typical open-source trajectories, placing the project on par with globally top-ranked repositories. GitHub stars serve as a measure of a project's usefulness and quality, and are widely used in the global developer community to gauge a technology's recognition and credibility.

Open Data Loader PDF is a tool that decomposes complex PDF documents into text, tables, and images, converting them into formats that artificial intelligence systems can process directly. PDF is the most widely used document format for AI training globally, but its complex internal structure has made data extraction difficult, creating a major bottleneck in AI development. Hancom signed a memorandum of understanding with DualLab, a global PDF technology specialist, in July last year and began joint development. The company released an initial version in September and launched v2.0 on June 12.

Version 2.0 employs a hybrid engine combining AI-based and direct extraction methods, running entirely in local environments without transmitting data to external servers. It comes with four AI add-ons — optical character recognition, table extraction, formula extraction, and chart analysis — and supports compatibility with third-party open-source AI models such as Docling. In the company's proprietary benchmark tests, the tool achieved the highest accuracy across all categories, including reading order, table extraction, and heading extraction, compared with peer open-source alternatives. Hancom published the test data and reproducible code on its official GitHub repository to ensure transparency.

Open Data Loader PDF completed registration as an official component of LangChain, a global AI development framework, last year. In 2026, the company plans to expand integration with major AI frameworks including LangFlow, LlamaIndex, and Gemini CLI, while also preparing Model Context Protocol functionality to support AI agents. Version 2.0 also adopted the Apache 2.0 license, which permits free commercial use, lowering the adoption barrier for enterprises and developers.

"This achievement is the result of Hancom's document data extraction technology being directly validated by the global developer community in terms of completeness and practicality, and it also confirmed the potential for expanding the technology ecosystem through diverse applications," said Hancom CEO Kim Yeon-su. "Through the transition to the Apache 2.0 license, we will develop this into an open PDF data platform that companies and developers worldwide can freely use and extend."

AI-translated from Korean. Quotes from foreign sources are based on Korean-language reports and may not reflect exact original wording.