Issue #47: OpenAI's ChatGPT 4o: Setting the Bar for Google and Apple

Howdy👋🏾, this was a busy week for AI. Monday started with OpenAI’s release of ChatGPT 4o at its Sprint event, which Sam Altman described on X as “magical.” Google I/O 2024 kicked off Tuesday with tons of new AI releases, a hint at the return of Google Glass, and AI-powered search. This is all happening less than a month before Apple’s WWDC, where Tim Cook has already promised transformative AI product releases.

If you haven’t checked it out, it’s worth watching. OpenAI released a fully conversational AI bot that responds and reacts in real-time. You can interrupt it, and it can react to changes and flow in a conversation. On stage, this looked amazing, and as always, I will hold out judgment until I get time to kick the tires on my own.

That said, these are the things that stood out to me:

The response speed is super impressive. If you listened to my interview with Zara, the Kindroid, I had to edit the many multi-second lags between my talking and Zara’s understanding and response to my words. The sheer lag between API responses makes it difficult to build something that feels conversational. I assume OpenAI is collapsing the tech stack, leveraging its own Whisper engine, a Text-to-Speech system, and building it directly on top of its LLM.
Conversations in real-time are challenging. We want immediate responses. Amazon’s founder, Jeff Bezos, required Alexa to respond in 2.5-3 seconds, and today, that seems like forever. We expect someone to respond immediately during a conversation. OpenAI and other LLMs can stream responses in a ghostwriter style, helping make things feel snappy. Even with that, complex questions can take a moment before the LLM can process the prompt and respond. To mask delays like this, call centers and other tools show an animation or play a processing sound to distract the user. OpenAI seems to have used phrases and words to create pauses for thinking, similar to how humans do. When it needs a moment to think, it adds, naturally, filler words and phrases like “interesting,” “I noticed that, “or “hmmm” to mask loading times when needed. I’m interested in testing this more, but it is a great way to buy time when streaming a message while still giving it the feel of something natural and conversational.
Emotion makes a big difference. To date, many tools have taken the language spoken by a person, transformed that into text, and then provided that as a prompt or argument to an AI. This loses the emotion, and emotion can change the very substance of a sentence or phrase. While dictation tools allow sentiment analysis, there isn’t a straightforward way to convey this information as text or to an AI prompt. OpenAI’s demo showed an assistant who could understand emotions, pick up on feelings, and respond kindly.
ChatGPT now has memory. The slide was quick, and I need to explore this more, but as described by OpenAI CTO Mira Murati, the memory works across conversations and chats and allows it to connect dots in a way I’ve long hoped for. I use and teach one of the configuration settings in my workshops that allows you to provide ChatGPT with details about who I am, but these details are static and based on a point in time. Memory promises the idea of allowing us to tag important points across conversations, like details on my favorite types of food, that we can remember and recall for future conversations.

The impressive demo set the bar pretty high for Google to match at Google I/O and maybe helped seal the deal with Apple for its next-generation Siri engine. Of course, I’m writing this before Google’s big event, but OpenAI knew what it was doing and scheduled this event to send flares up and steal Google’s thunder. I can’t wait to reflect next week. Now, my thoughts on tech & things:

⚡️Apple has been the underdog, and the company creatives love for so long. I think the backlash from its “Crush” commercial during last week’s iPad event has to be unexpected. John Gruber has a good piece on how sentiment has changed, something the Verge touched on in its recent Vergecast.

⚡️Towson is my go-to Apple Store and the closest to me in Baltimore City. I wondered if the union would change the store or delay the rollout of new devices like Apple Vision Pro, but from what I can tell, it feels just like any Apple Store, and it still has amazing customer service. I won’t pretend to know enough about Apple and its union relationship to lean in, but things like this pop that brand magic around Apple that made them once feel like more or better – when it’s still a business in the end. Now that the strike is authorized let’s see what happens next.

⚡️I like what Wolfe is picturing, and I have similar thoughts on how AI can help us in therapy and relationships. The beauty of a well-designed and private AI bot is that you can share insecurities that you can’t easily get out with a therapist or over the first few dates. It’s also potentially easier to be yourself. This lets the bot help make the first move, or maybe it sees a connection where two people might not have otherwise “swiped right.” The two big issues I have are 1) security. I’m not sharing my deepest feelings with just anyone’s chatbots, and I need to know how secure my conversational data is. Also, I don’t want to train 50 different bots to know who I am. I want one on a platform that can interoperate and talk with others, like Bumble’s concierge.

I’ve written a handful of conversational AI bots, and my struggle is the lag. If I use an API to handle the dictation and text-to-speech, the delay is too long, no matter what I do. The local tools get me much closer to conversational, but the quality of the voices doesn’t match the quality of tools like ElevenLabs. OpenAI seems to have collapsed the stack and essentially created a tool that takes direct audio and outputs audio.

To date, many of us have approached the solutions of talking to AI bots as dictation. We say words, take those words, and convert them to text, we pass that text as a prompt to an LLM, and then the response is transformed into audio. This is not enough because text can’t carry and convey emotions in a way that we can expect a computer or another person to understand. This is not a new problem, and folks like Amazon have introduced solutions like Speech Synthesis Markup Language or SSMS that let us annotate our words to add meaning and feeling beyond the text.

This week’s OpenAI release will push companies like ElevenLabs to offer speedier solutions that do not require enterprise licenses but can offer more immediate response times. This will create a greater need for a standard for conveying emotion and feeling in text that AIs can universally understand and accept

I’m sorry for the delayed newsletter release and for it being shorter than normal. It’s been a busy two weeks.

Last week, I had the great opportunity to deliver an abridged version of my workshop on building OpenAI assistants to a packed room of developers at Philly Tech Week, which was put on by Technical.ly.

This week, I also flew to my hometown of New Orleans for our client, Entergy’s Premier Supplier Awards. Entergy selected Mindgrub as its #1 customer experience supplier out of over a thousand vendors. It feels great to have a happy client and build a mobile application that gets accolades from groups like JD Power for being the best utility mobile application.

Next week, things should return to normal, as well as our regular newsletter cycle.

-jason

P.S. Also, what do you do with a conversational AI bot? You have that AI bot talk to itself! This video really shows just how good this new ChatGPT release is and how well it can adapt to changes in its video feed. Plus, it sings for us!