Americas

  • United States

Asia

Why is Apple so focused on vision AI?

news analysis
Mar 19, 20245 mins
AppleAugmented RealityGenerative AI

Because vision intelligence can understand what it sees, contextualize that information, make decisions based on the information, and change or alter the appearance of what is there.

eXeX and Cromwell Hospital pioneer the First Use of Apple Vision Pro in UK Surgery
Credit: eXeX

From its Darwin AI acquisition to recent reports claiming Apple might work with Google and others to support a wider array of generative AI (genAI) tools than it plans to introduce, it’s pretty clear the company has chosen to be focused in where it creates its own AI technologies.

At least one of these focus areas reflects work the company has been doing since before AI became a buzzword — and that’s vision intelligence.

Intimations of life

By this, I specially mean AI that can understand what it sees, contextualize that information, make decisions based on it, change or alter the view, and so on.

You might already be making use of this kind of AI:

  • Each time you photograph a document and Apple lets you copy the text to paste into another document.
  • When your iPhone can tell you where the doors of a building are.
  • When you tap the ‘I’ button in Photos to get connected to descriptions of what is visible.
  • When your iPhone tells you the meaning of a laundry label you expose it to.
  • When you use Translate to decipher text on signs around you.
  • When the LiDAR sensor provides you with a room map.

There are many other examples. There may even be better illustrations that demonstrate the direction of travel.

Electron blues

Apple’s researchers recently published a white paper that has generated consternation and comment since its release. It describes a technology called MM1, which is a Multimodal Model for Text and Image Data.

That means it can train large language models (LLMs) using both text and images and is being called a “significant advance” for AI. The models using the tech performed excellently at such tasks as image captioning, visual question answering, and natural language inference.

The system also showed strong in context-learning capabilities. In other words, it can learn fast by being exposed to text/words and images, which also means the tech could eventually handle really complex, open-ended problems. The latter is a holy grail for AI research, as achieving it means machines capable of solving problems in a highly contextual way.

That’s all good, but what’s important here is the use of images. This is not the first time in recent months Apple has harnessed machine vision intelligence this way. Toward the end of 2023, its Keyframer animation tool shipped, and even earlier in 2023 we heard that part of what the company intended to build was AI capable of creating realistic immersive scenes for use in Vision Pro.

Automated for the people

And the latter product is of course the space in which so much of Apple’s vision for Generative Visual AI may make the biggest difference, as the implications are profound. Think how it makes it possible for one person wearing a Vision Pro to enter an environment — any environment — and while exploring that space build a perfect digital replica of that place that can also be shared with others. Thing is, this tool isn’t just a dumb representation of the place; armed with vision intelligence, the resulting shared experience wouldn’t just look like the place you were exploring, with a few parameter tweaks to correct any errors, it would effectively be a fully functioning digital representation of that space.

This is useful in all kinds of situations, from traffic management to building and facilities management, but the capacity to build true-to-life, smart and intelligent  representations of spaces also extends to architecture and design. And, of course, there are evident implications for health.

None of these ideas may turn out to work quite the way I’m articulating, though I’m 100% certain Vision Pro’s place in building digital twins for multiple industries will turn out to be set in stone.

Everybody hurts

But the combination of new highly visual operating systems (visionOS) with a highly visual AI capable of deep contextual understanding and response isn’t something that’s just catching up with the famed Tom Cruise movie, Minority Report.

It is a tech deployment about to happen in real time that is moving beyond the visions of the futurologists who advised on that movie.

No wonder the entire industry now wants to move in Apple’s direction — it’s got to hurt to see the company get there fastest. But everybody hurts, sometimes.

Please follow me on Mastodon, or join me in the AppleHolic’s bar & grill and Apple Discussions groups on MeWe.

Jonny Evans

Hello, and thanks for dropping in. I'm pleased to meet you. I'm Jonny Evans, and I've been writing (mainly about Apple) since 1999. These days I write my daily AppleHolic blog at Computerworld.com, where I explore Apple's growing identity in the enterprise. You can also keep up with my work at AppleMust, and follow me on Mastodon, LinkedIn and (maybe) Twitter.