Mutimodality, Google I/O, Project Astra – Let’s go!
It’s that time of the year again, the time for big tech conferences! And Google I/O never fails to deliver. What always gets me excited is not just the unveiling of new tech, but seeing how these innovations are applied. In fact, back in 2019, one of the projects my specialized Edge Engineering team worked on was actually showcased in the Google I/O keynote!
Now, if you’ve been following my blog, you know I’ve been diving deep into the world of vision and multimodality for the past 5 years. My teams have been experimenting with everything from object detection and real-time tracking to object visualization and beyond. While I have been working with broader technology as VP engineering/technology with hundreds of direct reports, I have (almost) always kept a team working with innovation in a particular way. A way I have developed and refined over the years.
Google I/O was packed with interesting announcements, but one that really caught my eye was Project Astra. This multimodal assistant (meaning it can handle video, voice, text, and more) operates entirely within the viewfinder of your device. In the demo, they cleverly overlay UI elements and text onto a regular camera viewfinder to interact with Astra.
Over the years, I’ve witnessed the commoditization of many services needed for building these kinds of experiences. It’s getting easier! Now you can leverage native cloud services, like those on Google Cloud or Microsoft Azure (and other provides), to identify objects in near real time. Native AR features are baked into both iPhones and Androids. The list goes on.
Click the pictue for a direct link to the announcement.
I think this particular demo is brilliant, and I want to deconstruct some of it here.
This demo struck me as not only very cool but also feasible for some of my former teams to build. What really blew me away was when the presenter asked Astra if it knew where she had left her glasses… and it did! This use case is a no-brainer: detect objects, track their location, add a voice interface and AI assistant capabilities. We’ve done all of these things before, just not in this specific combination or with this level of object location and “intent” memory.
What’s truly exciting to me is that most, if not all, the services needed to create something like this are now publicly available. So what are you waiting for?
While I haven’t read up on Project Astra’s tech, I believe they are using Google Lens technology in combination with Gemini to achieve these things. As it’s experimental right now, I encourage you to try to create parts of this with services already available. You will be in a great position when Astra tech broadens with more availability.
Most companies do not realize how much service and power is available at the touch of a button. I see many use cases where “assistants” like Astra could grow sales, customer engagement, and loyalty.
Links to learn more and get inspired:
The Verge’s take on Project Astra announcement – https://www.theverge.com/2024/5/14/24156296/google-ai-gemini-astra-assistant-live-io
Engadget – https://www.engadget.com/google-project-astra-hands-on-full-of-potential-but-its-going-to-be-a-while-235607743.html
Google Cloud Object detection and tracking – https://developers.google.com/ml-kit/vision/object-detection
Google Cloud Object detection – https://developers.google.com/learn/pathways/get-started-object-detection
Azure object detection – https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/concept-object-detection
Experimentation: LLM, LangChain Agent, Computer Vision – https://teetracker.medium.com/experimentation-llm-langchain-agent-computer-vision-0c405deb7c6e
So, what does the future hold for multimodal experiences like this?
I’m writing on my next blog post, where I’ll explore the emerging world of conversational interfaces, voice, video, and why these technologies are unfolding in a particular sequence in front of our eyes.