Americas

  • United States

Asia

mike_elgan
Contributing Columnist

AI glasses + multimodal AI = a massive new industry

opinion
May 20, 20247 mins
Computers and PeripheralsEmerging TechnologyGenerative AI

New tech demos last week by OpenAI and Google show why smart glasses are the perfect platform for AI chatbots.

man glasses reflection computer
Credit: iStock

OpenAI last week demonstrated its GPT-4o multimodal AI model, and Google followed a day later with a demonstration of its Project Astra (a set of features coming later to Google’s Gemini). Both initiatives use video input (along with audio) to prompt sophisticated, powerful and natural AI chatbot responses. 

Both demos were impressive and ground-breaking, and performed similar feats. 

OpenAI is either further ahead of Google, or less timid, (probably both) as the company promised public availability of what it demonstrated within weeks, whereas Google promised something “later this year.” More to the point, OpenAI claims that its new model is twice as fast as and half the cost of GPT-4 Turbo. (Google didn’t feel confident enough to brag about the performance or cost of Astra features.)

Before these demos, the public knew the word “multimodal” mostly from Meta, which heavily promoted multimodal features for its Ray-Ban Meta glasses to the public in the past couple of months. 

The experience of using Ray-Ban Meta glasses’ multimodal feature goes something like this: You say, “Hey, Meta, look and tell me what you see.” You hear a click, indicating that a picture is being taken, and, after a few seconds, the answer is spoken to you with information like: “it’s a building” or some general description of objects in the frame of the picture. 

Ray-Ban Metas use the integrated camera —for a still image, not video — and the result is somewhat unimpressive, especially in light of the multimodal demos by OpenAI and Google. 

The powerful role of video in multimodal AI

Multimodal AI simultaneously combines text, audio, photos and video. (And to be clear, it can get the “text” information directly from the audio, photos or video. It can “read” or extract the words it sees, then input that text into the mix.) 

Multimodal AI with video brings the user-computer interface vastly closer to the human experience. While AI can’t think or understand, being able to harness video and other inputs puts people (who are also multimodal) on the same page about physical surroundings or the subject of awareness. 

For example, during the Google I/O keynote, engineers back at Google Deepmind headquarters were watching it, together with project Astra, which (as with OpenAI’s new model) can read and see and “watch” what’s on your computer screen. They posted this video on X, showing an engineer chit-chatting about the video on screen with the AI. 

Another fun demo that emerged showed GPT-4o in action. In that video, an engineer for OpenAI uses a smartphone running GPT-4o and its camera to describe what it sees based on the comments and questions of another instance on another smartphone of GPT-4o. 

In both demos, the phones are doing what another person would be able to do — walk around with a person and answer their questions about objects, people and information in the physical world. 

Advertisers are looking to video in multimodal AI as a way to register the emotional impact of their ads. “Emotions emerge through technology like Project Astra, which can process the real world through the lens of a mobile phone camera. It continually processes images and information that it sees and can return answers, even after it has moved past the object,” according to an opinion piece on MediaPost by Laurie Sullivan

The power of this technology for multiple industries will prove inestimable. 

Why all trends in multimodal AI point to AI glasses

Both OpenAI and Google demos clearly reveal a future where, thanks to the video mode in multimodal AI, we’ll be able to show AI something, or a room full of somethings, and engage with a chatbot to help us know, process, remember or understand. 

It would be all very natural, except for one awkward element. All this holding and waving around of phones to show it what we want it so “see” is completely unnatural. Obviously — obviously! — video-enabled multimodal AI is headed for face computers, a.k.a. AI glasses. 

And, in fact, one of the most intriguing elements of the Google demo was that during a video demonstration, the demonstrator asked Astra-enhanced Gemini if it remembered where her glasses were, and it directed her back to a table, where she picked up the glasses and put them on. At that point, the glasses — which were prototype AI glasses — seamlessly took over the chat session from the phone (the whole thing was surely still running on the phone, with the glasses providing the camera, microphones and so on). 

From the moment she put on those glasses, the interaction became totally natural. Instead of awkwardly holding up a smartphone and pointing its camera at stuff, she merely looked at it. (At one point she even petted and cuddled her dog with both hands while still using the chatbot.)

And in this Google DeepMind Astra video (posted after last week’s event), AI is interacting with content on a phone screen (rather than using the phone to point at non-phone objects). 

Given the video, it’s unlikely the commercialization of an actual consumer and business product — let’s call it “Pixel Glasses” — is imminent. Two years ago, Google I/O featured a research product showing translation glasses, which looked like a promising idea until Google killed it last year. 

What nobody’s talking about now is that, in hindsight, those translation glasses were almost certainly based on video-enhanced multimodal AI. While they translate audio of people speaking Mandarin, Spanish and English — with subtitles displayed to the wearer of the glasses in English — they also show American Sign Language translated into English subtitles. At the time, people shrugged at this segment of the video, but now it’s clear: Multimodal AI was reading the sign language and translating it in real time (or faking that). 

I think we need to update the narrative on this, which is that the Google translate glasses weren’t cancelled. That demo was really just an early prototype of the Astra features Google didn’t want to announce two years ago. 

And, in fact, the prototype glasses in the Astra video look the same as the glasses in the translation glasses video — they’ll probably using the same prototype hardware. 

Meanwhile, we were reminded that Google continues to work on AI glasses hardware products when on May 9, the Patent Office granted Google a patent based on technology owned by a company it acquired four years ago, called North. The patent describes systems and methods for laser projectors with optical engines capable of measuring light intensity and laser output power. These projectors are designed to be integrated into AI glasses. 

While companies like Google can design and manufacture their own AI glasses, any other AI company could partner with either Luxottica, as Meta has, or with a startup like Avegant, which (together with partners Qualcomm and Applied Materials and which I wrote about in March) can supply the hardware for a product branded with the AI company’s brand. So, we can look forward to OpenAI Glasses, Perplexity Glasses, Pi Glasses, Bing Glasses, Claude Glasses and (my favorite possibility) Hugging Face Glasses.

I’ve been predicting that a massive AI glasses industry is about to take off and take over. It will probably happen next year. And the new trend in multimodal AI with video as one of the modes should convince everyone how big the AI glasses market will be.