AI Voice Actor Trained on 1.2 Million Hours Conquers YouTube and Instagram

News|
|
By Kim Tae-ho
|

"So this time, Trump is saying~"

A distinctive voice has dominated YouTube Shorts and Instagram Reels for several years. The protagonist is "Changsu," a young male character who speaks hurriedly as if perpetually busy, with subtle traces of a Chungcheong Province dialect. Changsu's voice is primarily used in short-form economic and current affairs content, with countless social media channels employing it like a podcast announcer. Changsu sometimes raises his voice to express excitement, and other times adjusts his speaking pace to create a serious atmosphere.

"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report] - Seoul Economic Daily Technology News from South Korea
"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report]

The original owner of Changsu's voice is veteran voice actor Hyun Kyung-su, who debuted in 2003. Despite countless SNS accounts using the voice, Hyun does not record new material. His previously recorded voice continues to be reproduced through artificial intelligence learning. Users input scripts and specify desired emotional tones, and Changsu's lifelike voice is generated instantly. This is how the AI voice service "Typecast" operates.

Kim Tae-su, CEO of Neosapience, the company behind Typecast, told The Seoul Economic Daily on the 25th, "Besides Changsu, we have many popular characters including Valkyrie, an always-angry female character, and Changu, frequently used for donation messages in internet livestreams." He added, "Typecast offers 682 such characters."

682 AI Voice Characters Created with LLM

"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report] - Seoul Economic Daily Technology News from South Korea
"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report]

Typecast is an AI text-to-speech (TTS) service launched by startup Neosapience in November 2019. TTS refers to computer programs that read text as artificial speech. The concept emerged in the 1990s and was widely used in the 2000s and 2010s for screen readers and public transportation announcements. However, early TTS systems combined pre-recorded sounds and syllables, producing unnatural results.

In contrast, Typecast successfully achieves natural vocalization that sounds like actual human speech. Users can easily adjust speaking speed, pitch, and emotional expression. After launch, Typecast gained significant popularity among YouTubers who prefer not to use their own voices.

The service notably gained attention when it was revealed that both the young male and child voices appearing in "1-Minute Cooking Ddukddak Brother," a cooking YouTube channel with 3 million subscribers, were created using Typecast.

"Six years after launch, cumulative registered users exceed 2.8 million," Kim said. "About 60% appear to be individual content creators, with the remaining 40% being corporate or institutional clients."

When asked how Typecast manages numerous voices while maintaining natural vocalization, Kim pointed to massive training data and continuous AI model updates. "The voice training data fed into our proprietary model exceeds 1.2 million hours," he noted. The natural script reading results from extensively training the AI on properly connected text-to-voice data.

"While exact comparisons with other companies are difficult, we would certainly rank within the top five globally," Kim said confidently.

Another differentiator Kim emphasized is the AI model itself. Neosapience operates Typecast using "SSFM," its proprietary large language model. LLMs are considered optimal for interpreting and learning complex natural language structures and play a crucial role in determining generative AI service quality.

"We've trained the capability to generate any voice, emotion, or language," Kim said. "It's like a comedian who can instantly mimic any voice." He added, "As Typecast users accumulate data creating natural speech, SSFM evolves through reinforcement learning into an even more natural-sounding model."

Founding on the Bet That "AI Voice Market Will Bloom"

With annual revenue of 10.6 billion won, Neosapience is now growing steadily, but early days were challenging. Kim, who worked as a voice-related engineer at LG Electronics and Qualcomm, began closely watching AI around 2016, when Lee Se-dol played AlphaGo. AI-related papers were flooding both academia and industry at that time.

"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report] - Seoul Economic Daily Technology News from South Korea
"AI voice actor trained for 1.2 million hours... has taken over YouTube and Instagram" [Scale-Up Report]

Kim concluded that "if a world comes where AI sees, hears, and speaks, my expertise would be most valuable in speaking," and launched his startup.

The following year, Kim left Qualcomm to establish Neosapience. Typecast wasn't developed immediately. Though the company possessed deep learning-based TTS technology, it failed to attract market attention. While natural emotional expression was a strength compared to existing TTS, the computational demands made service operation slow.

After two years of various attempts, the company made its final push in 2019: creating a TTS service for SNS content creators. The business idea stemmed from demand among bloggers and Facebook users who wanted to enter YouTube but couldn't find suitable voices.

A free beta version launched in April 2019 with fewer than 20 characters. Being free, voice actor intellectual property issues remained unresolved, preventing commercial use. Nevertheless, customer response was enthusiastic.

"Users of the beta service flooded us with requests saying, 'Let us pay so we can use it commercially,'" Kim recalled.

When asked why the company continued investing in AI TTS development despite early struggles, Kim replied, "It was a worthwhile bet." He explained, "Throughout technology history, new technologies never immediately replace existing ones. As development speeds increase and investment costs decrease over time, I was confident AI TTS could disrupt the existing market."

Neosapience's next step is capturing the AI voice market beyond video content. Kim forecasts that as physical AI rises, voice will become the most essential tool for human-AI communication.

"When physical AI becomes mainstream, voice will be a design element determining physical AI's first impression," Kim said. "Neosapience's business will also expand from AI TTS to conversational AI solutions."

Related Video

AI-translated from Korean. Quotes from foreign sources are based on Korean-language reports and may not reflect exact original wording.