AI+Vision+Voice – How did we get here and where are we going?
Get ready for the AI-assistants: My view on the future of AI interaction and beyond
Today, we see a clear trajectory of AI growth, driven mainly by large language models (LLMs). The hype cycle focuses on these LLMs, but in reality, AI encompasses much more, including machine learning models with supervised learning such as linear regression, decision trees, and neural networks for deep learning. It also involves pattern recognition with fuzzy logic and other techniques. The list goes on.
For me personally, LLMs are almost like the frontend of software engineering. They’re what everyone sees, but in the background, there’s a lot more going on.
But they also have a large role in the next generation of AI applications: The AI-assistants.
In this article, I will look into what has been missing to make AI-assistants really take off, why LLMs are important in this, and how I see the future of interaction coupled with this.
Inuts and outputs, a 50 year old paradigm
Since the beginning of time (or since Xerox PARC?), we have interacted with computers through a screen and keyboard/mouse. In the early 1970s, smart folks at Xerox PARC released the Xerox Alto, sometimes referred to as the mother of all computers. It had a graphical user interface (GUI) and a mouse and keyboard as input methods. While the machine itself was probably slightly ahead of its time and wasn’t a sales success, it set the tone for almost all further development of computers.
This means the user interface paradigm I am writing this article with is at least 50+ years old. The world has changed since the 1970s, but this paradigm is still in play.
However, this is not completely true. We have seen the development of mobile phones essentially acting as a second portable screen. You can do most things on your phone, but most people agree when work needs to be done, you move to your computer, which is often a laptop these days.
Despite my comments on this 50+ year-old paradigm, a lot of research and development has been put into challenging and exploring new paradigms.
Microsoft HoloLens
The HoloLens was the first “next generation” hardware that truly impressed me. It’s a headset that overlays information on a HUD, much like the helmets fighter pilots use. Back in 2019, when the HoloLens 2 was released, my research and development teams got their hands on a couple of units, and I was able to try it out. This was my first experience with high quality, somewhat better field of view (FOV) than the first HoloLens, and excellent image quality. While the hardware was very nice, it did feel a bit bulky and pricey. However, what was really noticeable was that the old computer experience paradigm lived in the headset: virtual keyboards and questionable voice input methods. The headsets were great but really had to find their use cases. I believe the use cases were mostly in the industrial sector and not so much in retail, which I was leading R&D in. I believe they are only selling these units to enterprises now.
Google Glass
Perhaps you remember hype around the Google Glass project, where a prism was essentially put in your field of view and projected text directly onto the retina of your eye. I tried it many times, and my teams had several units for development and research. There were a couple of clever features like taking photos, getting directions, and discreetly receiving messages. While those features were fine, it was a bit hard wearing them with glasses, and overall interaction was not great.
Virtual Reality
Virtual reality (VR) has been in development since the late 1960s, with more attention during the 90s when both Nintendo and Sega explored VR gaming systems with little success. It took until the 2010s for technology with powerful graphics and motion tracking to come to the market and for virtual reality to gain a foothold outside the tech community. The spike in recent years around the Metaverse helped grow this business, although we can now see sales cooling down again.
Personally, I own multiple VR headsets and have spent significant time, both professionally and privately, to understand these experiences. A very quick summary is that these experiences are somewhat stuck in the desktop paradigm. Want to input something? Pull up the virtual keyboard. Want to interact with objects? Use the included multi-button controllers. The latest iterations include both eye and hand tracking, which help a lot, and most of these headsets now do an OK job of letting you interact with your hands as a complement to the controllers. Criticism often revolves around feeling isolated with the experience, being unable to easily share with people around you, and not really knowing what’s going on outside the headset mounted on your face.
Augmented Reality
Augmented reality (AR) is exactly what it sounds like: it augments your view by overlaying information and nowadays can also analyze and interpret the environment in your field of view. If Google Glass was the seed, this is the flower. Many of the newer VR headsets also support AR with something called Passthrough. It’s as simple as it sounds: by passing through the video from the front cameras as the background of the VR experience (almost like a desktop wallpaper), you minimize the feeling of isolation.
Newer developments in AR include Meta’s Ray-Ban Stories project, where the use cases are slimmer, focusing both hardware and software on photo and video capture, phone calls, and live streaming. I see this as a clever bridge to their next iteration, which probably will include more overlaid interaction.
VOICE + AI + VISION = EXPERIENCE OF THE FUTURE
As you can see, technology has matured and is moving at a fast pace. In my analysis, weak methods of interacting and inputting into almost all these experiences are the main challenges. Virtual keyboards are painful. Voice input has been bad for at least the past 30 years.
This is where new developments in AI come in. In my work over the past many years, it’s been clear that the conversational paradigm is entering our world. Initially, it was chatbots that mostly annoyed us when contacting customer support. Later on, ChatGPT was born, and the broader masses could see the potential of AI through simple conversations.
Critiques of the conversational paradigm often focus on its complexity. Why do I need to chat with an AI when I want to return my goods, when I used to be able to just press a button? Why did a full sentence on my keyboard replace a single click on a button? This doesn’t make it feel simpler, rather the opposite.
Then came leveraging context. So we want you to write that sentence because we can create a context around your return ticket. To learn why you are returning it: are you not happy with the color, size, etc.? And next time you need to return something, perhaps the AI can even advise you on your next order.
What comes next is the rise of real conversational interfaces, where AI-powered services like ElevenLabs, Google Voice, and Azure Realistic AI voice generator (I can go on forever) will enable use cases where voice interaction actually works—and works like a charm.
The future is conversational AI + Vision + Voice
Why conversational AI matters
Conversational AI, powered by large language models, has already shown great promise. Services like ChatGPT have demonstrated that AI can understand and generate human-like text, making interactions more intuitive and efficient. However, text-based conversation alone is not enough.
The role of vision & voice
Vision-based AI adds a missing dimension to this interaction. By integrating computer vision, AI can interpret and understand the visual context of the user’s environment. This means that AI assistants can see what you see, recognize objects, read text in the real world, and provide relevant information or assistance based on visual inputs.
Voice interaction, enhanced by recent developments in AI-driven voice recognition and generation, provides a natural way to communicate with AI. Technologies like ElevenLabs, Google Voice, and Azure Realistic AI voice generator have made significant development in creating realistic and responsive voice interfaces. Recently, OpenAI also released ChatGPT 4o with voice as part of the experience, reducing latency training a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.
The synergy of conversational AI, Vision, and Voice
The combination of these three technologies – conversational AI, vision, and voice – creates a powerful entity that will change how we interact with technology:
- Enhanced context awareness: AI can understand not only what you say but also the context in which you say it. For example, if you ask your AI assistant about a specific product while pointing your phone at it, the AI can provide detailed information about that product. The potential with more sophisticated wearables are gigantic.
- Natural interactions: Conversational AI can interpret spoken commands and questions, making interactions more natural. Instead of typing or clicking, you can simply speak to your AI assistant, and it will respond intelligently based on what it sees and hears.
- Seamless integration: This integrated approach can be embedded in everyday devices like smartphones, AR glasses, and even home appliances, making AI assistance almost invisible and always accessible. I believe the future holds the promise of integrated wearable devices that will enable this seamless integration on an entirely new scale.
Real-world use cases
Imagine walking into your kitchen and asking your AI assistant what ingredients you’re missing for a recipe. The AI can visually scan your pantry and fridge, recognize the items you have, and tell you what you need to buy. And order those. Or picture yourself in a retail store, asking your AI assistant for reviews and recommendations on a product you’re looking at. One of my favourite use cases was when my lab team scanned a competitior furniture piece and the model returned the equivalent model from our stores. Compare on price, size and reviews without pressing a single button.
Or why not scan your room and the AI will measure and suggest which shelf and carpet to get, while recommending complementary items. The AI can instantly provide this information and also virtually place the objects in its field of view, making shopping more informed and efficient. Even the purchasing process could be streamlined into this flow.
Or why not have your AI-assistant connected with the retail store AI-agent, like a powerup in a game? Your personal stylist, interior decorator and color expert. The retail stores will compete on the usual entities like price and availability, but also on how skilled their autonomous retail AI-agent is.
While this seems futuristics – from a technology standpoint this can be done now, and we are already seeing glimpses of this in the wild.
Conclusion: The future of Interaction
The future of interaction lies in the convergence of conversational AI, vision, and voice. These technologies will transform how we interact with the digital world, making it more intuitive, efficient, and integrated into our daily lives. Now is the time to explore and utilize these advancements to reap the benefits!