Slightly more than 10 months ago OpenAI’s ChatGPT was first released to the public. Its arrival ushered in an era of nonstop headlines about artificial intelligence and accelerated the development of competing large language models (LLMs) from Google, Meta and other tech giants. Since that time, these chatbots have demonstrated an impressive capacity for generating text and code, albeit not always accurately. And now multimodal AIs that are capable of parsing not only text but also images, audio, and more are on the rise.
OpenAI released a multimodal version of ChatGPT, powered by its LLM GPT-4, to paying subscribers for the first time last week, months after the company first announced these capabilities. Google began incorporating similar image and audio features to those offered by the new GPT-4 into some versions of its LLM-powered chatbot, Bard, back in May. Meta, too, announced big strides in multimodality this past spring. Though it is in its infancy, the burgeoning technology can perform a variety of tasks.
What Can Multimodal AI Do?
YEAR CATFISH tested out two different chatbots that rely on multimodal LLMs: a version of ChatGPT powered by the updated GPT-4 (dubbed GPT-4 with vision, or GPT-4V) and Bard, which is currently powered by Google’s PaLM 2 model. Both can both hold hands-free vocal conversations using only audio, and they can describe scenes within images and decipher lines of text in a picture.
These abilities have myriad applications. In our test, using only a photograph of a receipt and a two-line prompt, ChatGPT accurately split a complicated bar tab and calculated the amount owed for each of four different people—including tip and tax. Altogether, the task took less than 30 seconds. Bard did nearly as well, but it interpreted one “9” as a “0,” thus flubbing the final total. In another trial, when given a photograph of a stocked bookshelf, both chatbots offered detailed descriptions of the hypothetical owner’s supposed character and interests that were almost like AI-generated horoscopes. Both identified the Statue of Liberty from a single photograph, deduced that the image was snapped from an office in lower Manhattan and offered spot-on directions from the photographer’s original location to the landmark (though ChatGPT’s guidance was more detailed than Bard’s). And ChatGPT also outperformed Bard in accurately identifying insects from photographs.
For disabled communities, the applications of such tech are particularly exciting. In March OpenAI started testing its multimodal version of GPT-4 through the company Be My Eyes, which provides a free description service through an app of the same name for blind and low-sighted people. The early trials went well enough that Be My Eyes is now in the process rolling out the AI-powered version of its app to all its users. “We are getting such exceptional feedback,” says Jesper Hvirring Henriksen, chief technology officer of Be My Eyes. At first there were lots of obvious issues, such as poorly transcribed text or inaccurate descriptions containing AI hallucinations. Henriksen says that OpenAI has improved on those initial shortcomings, however—errors are still present but less common. As a result, “people are talking about regaining their independence,” he says.
How Does Multimodal AI Work?
In this new wave of chatbots, the tools go beyond words. Yet they’re still based around artificial intelligence models that were built on language. How is that possible? Although individual companies are reluctant to share the exact underpinnings of their models, these corporations aren’t the only groups working on multimodal artificial intelligence. Other AI researchers have a pretty good sense of what’s happening behind the scenes.
There are two primary ways to get from a text-only LLM to an AI that also responds to visual and audio prompts, says Douwe Kiela, an adjunct professor at Stanford University, where he teaches courses on machine learning, and CEO of the company Contextual AI. In the more basic method, Kiela explains, AI models are essentially stacked on top of one another. A user inputs an image into a chatbot, but the picture is filtered through a separate AI that was built explicitly to spit out detailed image captions. (Google has had algorithms like this for years.) Then that text description is fed back to the chatbot, which responds to the translated prompt.
In contrast, “the other way is to have a much tighter coupling,” Kiela says. Computer engineers can insert segments of one AI algorithm into another by combining the computer code infrastructure that underlies each model. According to Kiela, it’s “sort of like grafting one part of a tree onto another trunk.” From there, the grafted model is retrained on a multimedia data set—including pictures, images with captions and text descriptions alone—until the AI has absorbed enough patterns to accurately link visual representations and words together. It’s more resource-intensive than the first strategy, but it can yield an even more capable AI. Kiela theorizes that Google used the first method with Bard, while OpenAI may have relied on the second to create GPT-4. This idea potentially accounts for the differences in functionality between the two models.
Regardless of how developers fuse their different AI models together, under the hood, the same general process is occurring. LLMs function on the basic principle of predicting the next word or syllable in a phrase. To do that, they rely on a “transformer” architecture (the “T” in GPT). This type of neural network takes something such as a written sentence and turns it into a series of mathematical relationships that are expressed as vectors, says Ruslan Salakhutdinov, a computer scientist at Carnegie Mellon University. To a transformer neural net, a sentence isn’t just a string of words—it’s a web of connections that map out context. This gives rise to much more humanlike bots that can grapple with multiple meanings, follow grammatical rules and imitate style. To combine or stack AI models, the algorithms have to transform different inputs (be they visual, audio or text) into the same type of vector data on the path to an output. In a way, it’s taking two sets of code and “teaching them to talk to each other,” Salakhutdinov says. In turn, human users can talk to these bots in new ways.
What Comes Next?
Many researchers view the present moment as the start of what’s possible. Once you begin aligning, integrating and improving different types of AI together, rapid advances are bound to keep coming. Kiela envisions a near future where machine learning models can easily respond to, analyze and generate videos or even smells. Salakhutdinov suspects that “in the next five to 10 years, you’re just going to have your personal AI assistant.” Such a program would be able to navigate everything from full customer service phone calls to complex research tasks after receiving just a short prompt.
Multimodal AI is not the same as artificial general intelligence, a holy grail goalpost of machine learning wherein computer models surpass human intellect and capacity. Multimodal AI is an “important step” toward it, however, says James Zou, a computer scientist at Stanford University. Humans have an interwoven array of senses through which we understand the world. Presumably, to reach general AI, a computer would need the same.
As impressive and exciting as they are, multimodal models have many of the same problems as their singly focused predecessors, Zou says. “The one big challenge is the problem of hallucination,” he notes. How can we trust an AI assistant if it might falsify information at any moment? Then there’s the question of privacy. With information-dense inputs such as voice and visuals, even more sensitive information might inadvertently be fed to bots and then regurgitated in leaks or compromised in hacks.
Zou still advises people to try out these tools—carefully. “It’s probably not a good idea to put your medical records directly into the chatbot,” he says.