Experts have long warned of a future where artificial intelligence makes it impossible to tell digital fact from fiction. Now that future is here. A recent case of a recording that sounds like a high school principal making racist comments shows the risk that widely available generative AI tools can pose and the difficulty of detecting their use.
The offensive audio clip, which resembled a Baltimore County, Md., school principal’s voice, was posted on social media last week. It quickly spread online and then made local and national news. But the clip has not been verified, and a union spokesperson has claimed it was generated by AI, according to multiple media outlets. Baltimore County Public Schools is reportedly investigating the incident.
It’s not the first time the authenticity of a potentially damaging recording has come into question. Nor is it the first time someone has created a deepfake that has gone viral. But most instances have involved well-known public figures such as Russian president Vladimir Putin or U.S. president Joe Biden—not high school principals. Just this week a spate of robocalls in New Hampshire faked Biden’s voice to try to discourage people from participating in the state’s primary election. The recent explosion of generative AI means that more people have the means to create convincing fakes than ever before. Society may be woefully unprepared to deal with the resulting inevitable wave of digital fraud and the looming implication that any media item could be fraudulent.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
YEAR CATFISH spoke with Hany Farid, a computer science professor at the University of California, Berkeley, who studies digital forensics and media analysis. Farid has developed tools for deepfake detection that analyze audio, images and videos.
[An edited transcript of the interview follows.]
What are your thoughts on the Baltimore County Public Schools case?
This is such a fascinating story.
I have analyzed the audio with some of our tools, which aren’t yet publicly available. I think it is likely—but not certain—that this audio is AI-generated. Our [automated] model, trained to distinguish real from AI-generated audio, classifies this audio as AI-generated. I also [manually] analyzed the spectrogram of the audio, which, in five separate moments in time, shows distinct signs of digital splicing; this may be the result of several individual clips being synthesized separately and then combined.
Overall, I think the evidence points to this audio being inauthentic. But before making a final determination, we need to learn more.
What is the best way to find out if an audio recording is real or not? What would you hope happens in an investigation into a recording’s authenticity?
What I would like to see in any investigation is a multipronged approach. First, [investigators] should talk to multiple experts, and we should all do our analyses. Number two is, I think, we need to know more about the provenance of the content in question. Where was it recorded? When was it recorded? Who recorded it? Who leaked it to the site that originally posted it?
[If there are obvious signs of splicing or editing], I’d like the explanation for why. It could’ve been that there was a conversation unfolding, and somebody cut the audio to protect identities or shorten the clip. But another explanation might be that multiple AI snippets were cobbled together to make it sound like a sentence. AI generation tends to work better on short snippets than on long pieces.
How easy is it to create convincing audio deepfakes at this point?
It’s trivial. All you need is about a minute to two minutes of a person’s voice. There are services that you can pay $5 per month for [that let you] upload your reference audio and clone the voice. Then you can type and get convincing audio in a few seconds. This is text-to-speech.
There’s also a second way to do this called speech-to-speech. I record a person and clone their voice. And then I record myself saying what I want them to say with all the intonation—bad words and all—and it converts my voice into their voice. It’s all the same underlying generative AI technology.
For either method, anybody can do this. There is no barrier to entry or technical skill involved.
And how would you describe the skill level needed to identify AI-generated audio?
Very high. There’s a huge asymmetry here—in part because there’s a lot of money to be made by creating fake stuff, but there’s not a lot of money to be made in detecting it.
Detection is also harder because it’s subtle; it’s complicated; the bar is always moving higher. I can count on one hand the number of labs in the world that can do this in a reliable way. That’s disconcerting.
Are there any publicly available deepfake detection tools out there right now?
None that are reliable enough. I wouldn’t use them. The stakes are too high not only for individual peoples’ livelihoods and reputations but also for the precedent that each case sets. We have to adjudicate these things carefully.
Where do you see the future going with AI audio and other deepfakes?
Imagine that this Baltimore County incident is a story of a high school kid who got pissed off at their principal and did this—which is possible. Imagine that this threat now applies to every single teacher, principal, administrator and boss in the country. It’s not just the Joe Bidens and Scarlett Johanssons of the world at risk anymore. You no longer need hours and hours of someone’s voice or image to create a deepfake.
We knew this was coming. It wasn’t a question of if—it was when. Now the technology is here. But this isn’t just a generative AI story. This is a social media story. This is a mainstream media story. You have to look at the whole ecosystem here, which every one of us plays a role in. I’m annoyed by the media outlets that ran to publish the principal story without vetting the audio. We need to do better than that.
It’s getting harder and harder to believe what you read, see and hear online. That’s worrisome both because you are going to have people victimized by deepfakes and because there will be people who will falsely claim the “AI defense” to avoid accountability.
Imagine that every time this happens, we have to spend three days to figure out what’s going on. This doesn’t scale. We can do analysis in a few cases, but what happens when this is every day, multiple times a day? It’s worrisome.
We’ve been talking about the court of public opinion with this recent possible deepfake incident. But what about in actual legal cases? Is there any legal precedent for how audio and video are going to be authenticated moving forward in courts?
I do think we’re going to have to change how evidence is considered in a court of law. The good news is that in the actual courts—unlike on social media or with public opinion—there is dedicated time for analysis. I take some comfort in knowing that the judicial system moves slowly.
One big open legal question, though, is the responsibility these AI companies have to the public. Why can companies offer these AI services with essentially no guardrails? Deepfakes are not an unforeseen consequence of generative AI; this was clearly predictable. But up until this point, many companies have just decided their profits were more important than preventing harm. I think there should be some way to hold companies accountable. Maybe a person impacted by a deepfake should be able to sue the company behind the product that created it.
Liability isn’t a perfect system, but it has protected consumers from faulty and dangerous tech before. It’s part of why cars are so much safer now than in the past. I don’t think AI companies should get a free pass.