Issue #42: Will the Real Jason Please Stand Up? My Deepfake Experiment — Jason Michael Perry

Howdy👋🏾 , Increasingly, deep fakes are becoming a concern. In the last two weeks, my coworker Dean sent me an article about a Baltimore County Principal whose comments were possibly faked by AI-generated audio, and a fake Biden asked people not to vote.

This whole thing got me wondering how easily I could deep fake people who know me using commercially available tools, especially with OpenAI announcing it can clone voices from just 15 seconds of audio.

For my choice of apps, I decided to pay for a subscription to ElevenLabs. This amazing text-to-speech tool provides tons of premium voices and allows voice replication from a few seconds of audio. I also explored RunwayML’s features, which generated a custom voice from an existing script. To get started with ElevenLabs, I recorded this audio clip, accepted the disclaimers, and let it rip.

Script: “Hi, my name is Jason Michael Perry, the man with three first names and the CTO of Mindgrub. I live in Baltimore, and I was born and raised in New Orleans. I can’t help but wonder if the tech I’m using is good enough to pretend to be me or not… we shall see.”

I have to say, the weirdest thing about this experiment is that I realized something I think most of us can identify with – I don’t like hearing my own voice, and I’m not actually sure I know what I sound like. One of the things I hate the most is editing videos of myself because I keep wondering who that dude talking is. Think about it – it’s like you hear your voice, but half is what you say, and the other half is in your head. So, to test, I needed to lean on people who are not me but know me to tell me if it sounds like me.

To start, I fed ElevenLabs my exact script and real voice, which generated this audio clip. Do you think this sounds like me?

The common agreement was that it does and it doesn’t. Some words it gets, some it pronounces very differently, but it’s more like a hint of me missing the nuance to be a true replica. Even if it has my voice right, it often misses something.

As a developer, my immediate thought is that it needs more data! I gave it a taste of my voice, but would a larger sample size get me closer to a real replica? To help, I turned to Anthropic’s Claude and asked it to generate a script that would help give ElevenLabs more nuance to close the gap. This is the script Claude recommended, and I read the whole thing and uploaded it to ElevenLabs.

Welcome to the voice training script. This text is designed to cover a wide range of sounds, intonations, and expressions to aid in creating a comprehensive voice model.

Let’s begin with some simple words: cat, dog, house, tree, sun, moon, car, book, pen, and computer.

Let’s move on to longer words: encyclopedia, revolutionary, mischievous, philanthropic, serendipity, quintessential, onomatopoeia, and diaphragm.

Next, we’ll cover a few phrases:

  • The quick brown fox jumps over the lazy dog.
  • She sells seashells by the seashore.
  • Peter Piper picked a peck of pickled peppers.
  • How much wood would a woodchuck chuck if a woodchuck could chuck wood?

⠀Let’s express some emotions:

  • I’m so excited to be here today! (enthusiastic)
  • I can’t believe this is happening. (surprised)
  • I’m feeling a bit under the weather. (sad)
  • That joke was hilarious! (happy)

⠀Now, let’s try different intonations:

  • Are you coming to the party tonight? (rising intonation for a question)
  • I can’t wait for the weekend! (exclamatory intonation)
  • I’m not sure if I can make it to the meeting. (uncertain intonation)

⠀Lastly, let’s include some longer sentences:

  • The old man sat on the bench, feeding the pigeons as he reminisced about his youth.
  • The leaves rustled gently in the breeze, creating a soothing melody in the tranquil forest.
  • The astronaut marveled at the vast expanse of the universe, feeling both insignificant and inspired by its grandeur.

⠀That concludes our voice training script. Thank you for your participation!”

By reading this script in your natural speaking voice, you’ll provide the AI with a diverse range of words, phrases, and expressions to learn from, helping it to better mimic your unique vocal characteristics.

I gave this new script a second pass with the same text script to see how well it did—and you can tell that the additional audio changed it from the last sample. So, is this me?

I wanted to broaden the experiment and see how well this might compare to RunwayML, which, I trained to make images from my photos just last week. RunwayML is a bit different in that it gives you a script to read versus letting me upload whatever audio I want, so the sampling data is not the same, but I can at least promise I used my voice to create it. 

Of course, a good deepfake is more than good audio, and the next progression is to pair this with a video of me talking. To do this, I recorded a talking head video of me discussing exactly what I planned to do and reading the script.

This is a generated video of me with my fake voice using the above script:

This is where I learned a new lesson: Videos with movement make it harder to create a quality deepfake. I’m sure this will change as the tech evolves, but the recommendation is a video needs to be pretty stable, with the head staying as much in the same place as possible. It also suggested as little body and face movement as possible.

This is unsurprising – if you think of the many deep fakes of former Presidents, they tend to capture them at a podium with little body movement. I tend to move my hands as I speak, so I tried my hardest to keep my movement to a minimum and switched from an external camera to my Mac’s webcam for a tighter shot. I combined this with a fake weather script of a made-up weather situation.

I’m also realizing that camera distance can help make a deepfake more believable. My videos are pretty close up, but that makes the focus on my lips more pronounced. I’ll look to try that out and report back in a future newsletter.

If you’re curious, you can also listen to the audio track generated to create this video on SoundCloud. The script was generated by Claude based on a prompt to make a script about it raining men.

How close is the AI Jason compared to the OG? I love to hear your thoughts, so reply back or post a comment on LinkedIn. While you ponder deep fake Jason, here are my thoughts on tech & things:

⚡️A backdoor introduced into a commonly used open-source tool built into tons of software has opened fears to a less considered vector of attack. The quick story is about a contributor who, on his own time, maintains a piece of open-source software and is pressured to take on a co-contributor to apply patches and updates more quickly. It turns out that the contributor was a state-sponsored operator who added a backdoor into this platform, compromising countless amounts of software that depends on this piece of code. This write-up on the storyline gives a good nontechnical story of how it happened, but Arstechnica has a very in-depth piece if you want to know even more.

⚡️The first non-Apple App Store is nearing launch in the EU, and The Verge has a peak at the AltStore. I’m curious and hopeful that these stores see enough adoption to loosen Apple’s app store guidelines and maybe provide an alternate means for apps Apple does not allow, like retro game consoles using ROMS.

⚡️Anthropic researchers find that a larger context window allows AI models to get better at answering questions over the length of a conversation – because they begin to understand the need better and respond better. That same ability also makes these models more willing to respond to inappropriate requests as the conversation continues. It’s like asking someone to do something illegal, and they say no, then you have dinner and talk for hours and ask again, and they’re more willing to help.

Before I started this process, I wondered how doing this might make me more susceptible to deep fake. However, the more I experimented with these tools, the more I realized that anyone with even a small amount of video on social media had already provided more than enough material to do what I did with my own voice. These tools have become so good that you do not need that much video or audio for a first attempt.

Also, this is just a reminder that I launched a series of online and in-person workshops. I’m still getting things started, but I have a new course curriculum that focuses on a quick two-hour introduction to AI and then jumps into how you can use some of the same tools I use in my newsletter in your daily work.

-jason

p.s. NYC’s Mayor loves technology, but his love has led to him consistently striking out with rushed implementations. The most recent is a NYC Chatbot that seems not to know city laws and regulations. Look, folks, AI chat can be amazing, but I can’t stress the importance of using RAG to augment data with new and more current information, writing solid instruction sets to limit what it should and should not do, and testing. Test, test, and test again.