As deepfakes multiply, OpenAI is improving the technology used to clone voices — but the company claims it is doing it ethically.
Today marks the preview release of OpenAI’s Voice Engine, which expands the company’s existing text-to-speech API. speech Engine, which has been in development for approximately two years, allows users to upload any 15-second speech sample and build a synthetic clone of that voice. However, there is no set timetable for public access, which allows the corporation to adjust to how the model is utilized and misused.
“We want to make sure that everyone feels good about how it’s being deployed — that we understand the landscape of where this technology is dangerous and have mitigations in place for that,” Jeff Harris, a member of OpenAI’s product crew, told Shuttech in an interview.
TRAINING THE OPENAI MODEL
According to Harris, the generative AI model that powers Voice Engine has been hidden in plain sight for quite some time.
Embrace the challenges, for they are the stepping stones to greatness. Every setback is an opportunity to learn, grow, and emerge stronger than before.
The same model powers the voice and “read aloud” features of ChatGPT, OpenAI’s AI-powered chatbot, as well as the preset voices accessible in OpenAI’s text-to-speech API. Spotify has been utilizing it since early September to dub podcasts for well-known broadcasters such as Lex Fridman in many languages.
I asked Harris where the model’s training data came from, which was a delicate issue. He will only confirm that the Voice Engine model was trained using a combination of licensed and publicly available data.
Models like the one that powers Voice Engine are trained on a massive number of instances — in this case, speech recordings — which are often obtained from public websites and data sets on the internet. Many generative AI companies view training data as a competitive advantage, thus they hold it and related information close to their breast. However, training data specifics are also a possible source of intellectual property challenges, which is another reason not to share much.
OpenAI is already being sued on charges that it breached intellectual property law by training its AI on copyrighted content such as images, artwork, code, articles, and e-books without crediting or compensating the authors or owners.
SYNTHESIZING VOICE
Surprisingly, the Voice Engine is neither taught nor fine-tuned with user data. This is due in part to the model’s ephemeral speech generation, which combines a diffusion process with a transformer.
“We take a small audio sample and text and generate realistic speech that matches the original speaker,” he explained. “The audio that’s used is dropped after the request is complete.”
Voice Talent as Commodity
ZipRecruiter’s voice actor pay ranges from $12 to $79 per hour, which is significantly greater than Voice Engine, even at the low end. If it gains traction, OpenAI’s technology has the potential to commodify voice labor. So where does it leave the actors?