With GPT-4o, you can now engage with ChatGPT via "any combination of text, audio, and image," according to OpenAI.
What really sets GPT-4o apart is that it was trained end-to-end on text, vision and audio together. (The "o" in its name stands for "omni.")
This unified multimodal training allows GPT-4o to grasp nuance and context that gets lost when using separate models for each input/output type, as in ChatGPT's Voice Mode.
The model also matches the text and code capabilities of GPT-4 while greatly improving performance on non-English languages. And it blows past existing models in its ability to understand images and audio.
During its announcement event, OpenAI showed off what GPT-4o can do.
In one instance, OpenAI engineers had a live back-and-forth conversation with GPT-4o in real-time with very little delay.
(OpenAI says the model responds to audio inputs in as little as 232 milliseconds—about the same response time as a human in a conversation.)
During audio chats, the model was also able to display a range of tones and react naturally to being interrupted, picking back up where it left off—just like a human in a conversation would.
In another demo, the engineers streamed video for the model in real-time.
One of the engineers streamed himself writing out a math problem, then asked GPT-4o to offer advice on how to solve it while he wrote.
The model also displayed impressive capabilities speaking in different languages, conversing with CTO Mira Murati in fluent Italian at one point during the demonstration.
What do you need to know about this stunning new model?
I got the answer from Marketing AI Institute founder and CEO Paul Roetzer on Episode 98 of The Artificial Intelligence Show.
OpenAI says text and image capabilities have started rollin gout. But the GPT-4o voice mode is still in alpha. It will roll out to ChatGPT Plus users "in the coming weeks."
Once live, GPT-4o's voice mode will be radically different than what exists today.
Currently, voice works through a pipeline of separate models. One transcribes your speech to text, one processes that text, then another converts the text back to audio.
This multi-step process means the AI loses a lot of information along the way.
GPT-4o solves this by combining text, vision and audio into one model. All inputs and outputs are handled by a single neural network.
"These things are now being trained on all these modalities from the ground up, which rapidly expands what they're going to be capable of doing," says Roetzer.
This seems to be a move towards a single, generally useful AI assistant that you can interact with seamlessly in your every day life.
What's just as exciting is that GPT-4o is now available to all ChatGPT users, not just paid users. (Though ChatGPT Plus users will get higher usage limits for the new model.)
This matters.
Far too many people are still using only the free version of ChatGPT which, until this announcement, included vastly inferior AI and limited capabilities.
"Anytime I'm on stage, I'll ask the audience who's used ChatGPT, every hand goes up," says Roetzer. "But who has the paid version? Most of the hands go down."
While powerful free AI is great, Roetzer cautions that access alone isn't enough. People need AI literacy to truly unlock its potential.
"Just giving people more powerful tools does not mean they're going to know what to do with them," he says.
"It's very rare you find people who are advanced users of these tools, who have really built their own prompt libraries, have pushed the limits of what they're capable of."
Still, putting GPT-4o in millions of hands for free is bound to accelerate adoption and use cases—especially among businesses and professionals.