AIs and Robots Should Sound Robotic

Most people know that robots no longer sound like tinny trash cans. They sound like Siri, Alexa, and Gemini. They sound like the voices in labyrinthine customer support phone trees. And even those robot voices are being made obsolete by new AI-generated voices that can mimic every vocal nuance and tic of human speech, down to specific regional accents. And with just a few seconds of audio, AI can now clone someone’s specific voice.

This technology will replace humans in many areas. Automated customer support will save money by cutting staffing at call centers. AI agents will make calls on our behalf, conversing with others in natural language. All of that is happening, and will be commonplace soon.

But there is something fundamentally different about talking with a bot as opposed to a person. A person can be a friend. An AI cannot be a friend, despite how people might treat it or react to it. AI is at best a tool, and at worst a means of manipulation. Humans need to know whether we’re talking with a living, breathing person or a robot with an agenda set by the person who controls it. That’s why robots should sound like robots.

You can’t just label AI-generated speech. It will come in many different forms. So we need a way to recognize AI that works no matter the modality. It needs to work for long or short snippets of audio, even just a second long. It needs to work for any language, and in any cultural context. At the same time, we shouldn’t constrain the underlying system’s sophistication or language complexity.

We have a simple proposal: all talking AIs and robots should use a ring modulator. In the mid-twentieth century, before it was easy to create actual robotic-sounding speech synthetically, ring modulators were used to make actors’ voices sound robotic. Over the last few decades, we have become accustomed to robotic voices, simply because text-to-speech systems were good enough to produce intelligible speech that was not human-like in its sound. Now we can use that same technology to make robotic speech that is indistinguishable from human sound robotic again.

A ring modulator has several advantages: It is computationally simple, can be applied in real-time, does not affect the intelligibility of the voice, and—most importantly—is universally “robotic sounding” because of its historical usage for depicting robots.

Responsible AI companies that provide voice synthesis or AI voice assistants in any form should add a ring modulator of some standard frequency (say, between 30-80 Hz) and of a minimum amplitude (say, 20 percent). That’s it. People will catch on quickly.

Here are a couple of examples you can listen to for examples of what we’re suggesting. The first clip is an AI-generated “podcast” of this article made by Google’s NotebookLM featuring two AI “hosts.” Google’s NotebookLM created the podcast script and audio given only the text of this article. The next two clips feature that same podcast with the AIs’ voices modulated more and less subtly by a ring modulator:

Raw audio sample generated by Google’s NotebookLM

Audio sample with added ring modulator (30 Hz-25%)

Audio sample with added ring modulator (30 Hz-40%)

We were able to generate the audio effect with a 50-line Python script generated by Anthropic’s Claude. One of the most well-known robot voices were those of the Daleks from Doctor Who in the 1960s. Back then robot voices were difficult to synthesize, so the audio was actually an actor’s voice run through a ring modulator. It was set to around 30 Hz, as we did in our example, with different modulation depth (amplitude) depending on how strong the robotic effect is meant to be. Our expectation is that the AI industry will test and converge on a good balance of such parameters and settings, and will use better tools than a 50-line Python script, but this highlights how simple it is to achieve.

Of course there will also be nefarious uses of AI voices. Scams that use voice cloning have been getting easier every year, but they’ve been possible for many years with the right know-how. Just like we’re learning that we can no longer trust images and videos we see because they could easily have been AI-generated, we will all soon learn that someone who sounds like a family member urgently requesting money may just be a scammer using a voice-cloning tool.

We don’t expect scammers to follow our proposal: They’ll find a way no matter what. But that’s always true of security standards, and a rising tide lifts all boats. We think the bulk of the uses will be with popular voice APIs from major companies—and everyone should know that they’re talking with a robot.

This essay was written with Barath Raghavan, and originally appeared in IEEE Spectrum.

Posted on February 6, 2025 at 7:03 AM34 Comments

Comments

A.I.nonymous February 6, 2025 8:48 AM

This technology will replace humans in many areas. Automated customer support will save money by cutting staffing at call centers. AI agents will make calls on our behalf, conversing with others in natural language.

Or conversing with other AI agents, including the ones in the call centers that are designed to frustrate consumers into giving up.

a.r.k. February 6, 2025 9:15 AM

My biggest issue with using a ring modulator is that those voices are rather… grating. I would get annoyed listening to them for any length of time. Why not just tone down the expressiveness of the voices to below average? They might sound like very dull humans, but at least it’d be tolerable and, after a couple of sentences, pretty easily recognizable as AI.

Arbol February 6, 2025 9:40 AM

I love the idea of making AI-generated voices sound like robots, but these samples don’t sound robotic, they sound over-compressed. I’ve experienced audio that sounds like these before, and it’s usually on low-quality compressed sources, like audiobooks. They don’t make me think of robots at all.

There are probably other effects that would evoke the idea of a robot, but this one isn’t very good.

wiredog February 6, 2025 10:05 AM

EX-TERM-I-NATE!

Sounds like they’re underwater, and my hearing is just bad enough that the third sample is almost too garbled to understand.

Me February 6, 2025 10:22 AM

I mean, this sounds like a good problem to solve, the issue as I see it, is this seems very much like the “evil bit”.

Sure those that want to help us determine that they as speaking as an AI will be able to, but those that do not will just skip this.

Lite February 6, 2025 11:36 AM

Listening to the modulation with headphones actually kind of hurts. I hear a very deep rumble. I’m not sure if that’s the 30Hz itself or an overtone of such but I could only listen to part of each clip that way. What about a higher frequency but at a smaller amplitude?

Jordan Brown February 6, 2025 12:16 PM

I was recently thinking about something similar, and came up with a simple idea.

Make it criminal to release, or to deploy, an AI that lies when asked whether it is an AI. “Are you human?” “No.” “Are you a robot?” “Yes”. Et cetera.

There are special cases, but special answers might be acceptable. “I am an AI with a human monitoring this and other conversations.” “I am an AI translating for a human who is unable to speak this language.” Et cetera. As long as it accurately describes the situation, it’s probably OK.

Bad people will of course do bad things, but such a rule would address many of the concerns, and would provide an additional tool for prosecuting the bad people.

MB Watson February 6, 2025 1:06 PM

Just have the bot start the conversation by saying “I’m a bot, not a person. How can I help you?”

Not significantly different from the current “This call may be recorded for training purposes.”

Rj February 6, 2025 2:14 PM

The basic idea is good, but how about trying a balanced modulator, as in single sideband radio? It sounds more robotic, yet is easier to undersstand. I realize that 50 lines of python was good enought to proove the concept, but it is probably not the best tonal effect.

C February 6, 2025 2:21 PM

I’m fine with making AI-generated voices sound funny, but I think it would be even better if they were legally prohibited from referring to themselves in the first person. There’s no “I” to say “I” there—and doubly it serves as “this is AI” marker to anyone listening.

lurker February 6, 2025 4:05 PM

Have I got this right?
You have an AI that goes to a lot of trouble to make a realistic sounding voice, then you want it to add, post facto, some distortion so the voice no longer sounds realistic. This is
a) a waste of resources in generating the realistic voice in the first place, and
b) a rule that is made to be broken.

We made AI to generate sounds and images that we cannot distinguish from the “real” thing. Now we have to get used to that, it’s only gonna get better/worse.

cls February 6, 2025 10:28 PM

Good idea, to to mark them and somehow signal that the voices are from an unreliable source.

Just make robot AI voices all monotone (no inflection) and clunky rhythm per syllable.

(and whatever the equivalent of no inflection is for tonal languages)

Everyone already understands that means “machine generated”.

Ring modulators are a thing in analog (music) synthesis, but no that’s not how robotic voices were generated in the 60s and 70s.

Indeed, do not look to Hollywood for any guidance on this. Think back to the first Star Wars films. How did they represent the extreme long distance in some of the hologram connections? Yeah, by de-res and tearing of the blue hologram image, and drop outs, noise, and metallic ringing of the voices. Just like the audience knew to expect from their analog NTSC TVs of the age. We all knew what it “meant”.

And we were still astonished, so futuristic, imagining the technology to talk between star systems, not just by phone, or even video phone, but a 3D hologram phone! And imagining the cost, in those pre breakup Ma Bell days. Ha ha

Sci Fi always represents the present.

Colin Ballast February 7, 2025 12:37 AM

You understand the dangers of this technology. Hopefully we can prevent it from delivering us into the hands of fascism. God Bless You Bruce Schneier.

R.Cake February 7, 2025 3:14 AM

nice idea, but as others have stated, the ring modulator effect proposed here sounds like line distortion to me, just about like a Teams call going bad, just before the line quits on me.
If “we” (ah, that word again…) wanted to influence this issue, I think it might be more powerful to create a financial deterrent for the operators of AI voice generators. As proposed above, outlaw it for them to lie when you ask them if they are a human. But why stop there – why not outlaw it for them to lie at all? That would also help with AI bots cold calling citizens for political campaigns or selling insurances.

On the other hand of course, with any such regulation you can only ever reach the official companies who at least make a residual effort of being a good citizen, or a citizen at all. The bigger part of the issue will come from gray-market or plain criminal actors anyway, and these are never kept at bay with laws and regulations. sigh

I guess, as they say in German, “the kid has already fallen down the well”.

Clive Robinson February 7, 2025 3:49 AM

@ Bruce, ALL,

With regards,

“An AI cannot be a friend, despite how people might treat it or react to it. AI is at best a tool, and at worst a means of manipulation.”

As I pointed out a while ago, the business plan for AI use in Microsoft, Google, etc can be said to be,

“Bedazzle, Beguile, Bewitch, Befriend, BETRAY”

As a way to obtain and use information ordinary people regard as in effect PRIVATE, Confidential, etc, and only let pass their lips or finger tips, because of the manipulation by the “Directing Mind” behind the AI tool.

The Directing Mind employer then uses the information to make significant sums of money in ways few can easily comprehend as the required mind set is in effect “alien to them”.

In short the Directing Mind is looking for ways to abuse the mind sets that make normal humans “social” thus form “societies”.

The one they have abused most obviously is the “Human notion of trust”. As Prof Ross J. Anderson once pointed out human trust is almost the opposite of what we call trust in security. A point that should be more widely known and understood, as it indicates why the likes of the big Silicon Valley Corps based on Internet use abuse people.

But a word about ring modulators and humans, two important points to note,

1, They increase the cognitive load
2, For some people with hearing issues, they make understanding the speech near impossible.

So there is a price to be paid for such measures, and you will get those with “AI Business Plans” using it for argument against such protections (heck do not be surprised if there is “Astroturfing” of organisations to achive such ends. After all you’ve seen the nonsense of the Anti-5G persons who were burning down –or atleast trying to– ordinary electricity pylons in the belief that they were being targeted by evil 5G operators.

As for Dr Who and the Daleks, they were a product of the BBC Radiophonic Workshop and it was not just electronic “sound effects” but video “visual effects” that originated from odd places inside “Auntie Beeb” 😉

But remember on of the Dr Who “creatives” was Douglas Adams, who gave us the “Hitchikers Guide To The Galaxy” which had “robots” and was originally just a radio show broadcast at some ridiculous hour of the day…

The “androids” including Startrek style slushy doors were described as having “Genuine People Personalities” and were thus “Your Plastic Pal whos fun to be with”. The star of which was “Marvin the Paranoid Android”[1].

Marvin’s speech is worth listening to because although it is not “robotic” in the sense of the Daleks and Cybermen of Dr Who, it’s clearly sufficiently different to be recognised as not human because it had a subtle irritating quality[2] (as did all the other “Genuine People Personality” voices and the Ships Computer).

Marvin self confesses to being an android with a GPP with the line “I’m a personality prototype, you can tell… can’t you?” after Arthur make a comment about the irritating self satisfied door had spoken.

So the use of a ring modulator is harsh on the ear and brain and will draw ire from some. However as Douglas Adams clearly demonstrated the use of such a harsh sound effect is really unnecessary.

And I also suspect Douglas has highlighted the fact that such AI androids, and similar will get “abused” by angry people both verbally and physically. After all there is that video of an office worker “defenestrating their computer” out of frustration.

[1] Marvin morphed in the various radio shows, TV show and stage plays as well as in the film,

https://en.wikipedia.org/wiki/Marvin_the_Paranoid_Android

[2] Whilst ment as a joke there is a very real sociological difference with “Genuine People Personalities”, and it needs to be remembered that this preceded the modern notion of “Karens” and their like that are some how “human” but “appart from the human race” and recognisable because of it,

https://hitchhikers.fandom.com/wiki/Genuine_People_Personalities

Carl Breen February 7, 2025 3:53 AM

The overuse of filler words and “like uhm like literally” often dubbed the California Valley accent (at least on Wikipedia) is worse for me personally than any other implication.

AI mimicking the flaws in regional speech behaviours immediately sounds dishonest and off-putting to me as a person.

I had to stop the player because it caused me discomfort. AI can sound realistic, but please make AI not sound psychopathic by trying to manipulate me as listener.

AI should be humble and polite, but using the absolute minimum of any “social” cues, to not feel as if it is trying to perform social engineering on me.

Hollywood did it right with Androids in the Alien franchise:

[in a neutral soft spoken curious tone] “Good morning Mister Schneier, how may I assist?”
Period.

A.I.nonymous February 7, 2025 5:15 AM

AI mimicking the flaws in regional speech behaviours immediately sounds dishonest and off-putting to me as a person. I had to stop the player because it caused me discomfort.

Welcome to the uncanny silicon valley.

fib February 7, 2025 6:40 AM

The arguments in this post are in line with those which propose to keep humanoid robots physical dimensions in check, and I agree with all of them.

‘https://www.mdpi.com/2078-2489/15/12/782

Regards.

Clive Robinson February 7, 2025 8:10 AM

@ fib,

I hope you are well, it’s certainly approaching freezing in the “Grate Grey Dampness” we call London, I’m hoping it’s nicer weather down your way.

With regards,

“The arguments in this post are in line with those which propose to keep humanoid robots physical dimensions in check”

Whilst I would not disagree with the idea there are a couple of issues,

1, Are you seeing all or some of the robot?
2, If not looking humanoid, what should we also not let them look like?

I could for arguments sake via a little taxidermy and mechanics make a very passable facsimile of something “cute n furry” like a racoon, squirrel or even somebody’s pet.

Because humans tend to anthropologize certain types of animate objects and give them first assumed human traits then start to treat them as proto-human and give them “human trust”…

Lets assume even if the hardware for AI needed a dozen or more warehouses of 19” racks, a nuclear reactor or three for power and a Niagara falls quantity of water for cooling it’s not an issue. Because big as that is it’s not really relevant, as modern high speed data communications can fit in a PCB smaller than your thumbnail as many WiFi Dongles can demonstrate.

Thus in effect you are giving the AI “a face” that can be almost as small as you like.

People are already building drones that fly as small as some insects,

https://m.youtube.com/watch?v=H6q6pYZ9Fho

Further the work on “Strandlopers” has shown they can be made not just efficient, but self sequencing making “control by actuator” and similar power hungry devices not that relevant.

Which makes the real limitation “energy used in time” as the only realistic measure that you can go to court with.

Clive Robinson February 7, 2025 9:08 AM

@ Bruce,

Off topic but you probably want to know about it.

https://www.bbc.co.uk/news/articles/c20g288yldko

“The UK government has demanded to be able to access encrypted data stored by Apple users worldwide in its cloud service.”

Note the “worldwide” this is on line with the unlawful requirements of “RIPA” and later UK legislation.

As far as the UK Security Services and much lesser persons such as the receptionist at your local “Town Council” are concerned they demand the right to trawl through your privacy without any kind of oversight or balance, no matter where you are in the world, even if you’ve never been in “UK controlled Areas”.

The UK Government view appears to be,

“If we can reach it from the UK we require access at all times.”

The problem is that there is no “unlawful behaviour” clause. That is lets say you live in a little village in deepest darkest Elbonia where power is generated by a “Donkey Mill” or similar. You have a computer that you keep turned off and locked up in a safe in your farmhouse basement along with your seasonal crop storage.

A person working for the UK Government unlawfully enters Elbonia, then illegally obtains access to your computer, illegally turns it on, and illegally connects it to say a “Hell-on-Rusk Star-fink” system. Then despite the repeated unlawful/illegal activities of the UK Gov employee as far as UK Government via it’s legislation is concerned they have “full legal access” and you can not complain or seek restitution etc, oh and “resisting” them in any way makes you a criminal…

Agammamon February 7, 2025 12:45 PM

But there is something fundamentally different about talking with a bot as opposed to a person. A person can be a friend. An AI cannot be a friend, despite how people might treat it or react to it. AI is at best a tool, and at worst a means of manipulation. Humans need to know whether we’re talking with a living, breathing person or a robot with an agenda set by the person who controls it. That’s why robots should sound like robots.

  1. You can’t be a friend with a customer service rep at a call center. Its just a business relationship.
  2. The rep has an agenda set by their employer. They may also have a hidden agenda of their own. The AI only has the former.
  3. I would say that people should not be ‘protected’ from AI by limiting the types of voices – AI and people both can be deceptive and people should be encourage to develop their own intuition as to when they’re being taken advantage of, whether its by a bot or a call-center worker.

By limiting the sort of voice an AI can have you only protect people from ‘honest’ corporations but leave them more vulnerable to criminals. I just do not see an advantage to having the ‘honest’ companies do this while accepting that criminals won’t and I see significant downsides to it.

Agammamon February 7, 2025 12:47 PM

I would also say, having listened to those samples, that they sound more like a bad connection than robotic.

Carl Breen February 7, 2025 1:27 PM

@A.I.nonymous

It got nothing to do with the uncanny valley for me. It simply is like a car salesman. I simply have a sense for superficial charm. The same as with youtubers overly obvious merely pretending to be my friend.

Uncanny would be a talking corpse. I often see the terminology wrongly attributed to other characteristics these days.

Clive Robinson February 7, 2025 6:19 PM

@ Bruce, ALL,

As some are aware I’ve some notions about current AI systems based on LLM and ML systems.

One of which is that they are little more than DSP systems implementing glorified “Matched Filters” or “Adaptive Filters” respectively.

Well it appears I’m not the only one with this view point,

https://m.youtube.com/watch?v=2LiCFdXf3M8

Is a talk given by somebody who works in the field from a different aspect to most, describing current AI systems to those in the “Amateur Radio” community in Austrlia(VK land).

You will find not only is his opinions very much like mine, his practical example is “spot on” for what I write… But as it’s a “talk” it’s easier than reading my words 😉

Clive Robinson February 7, 2025 6:57 PM

@ Bruce,

With regards my posting above at “6:19 PM” it’s actually quite relevant to this topic.

Consider,

“Use AI to spot AI output.”

As I’ve pointed out in the past LLM output is effectively “Noise plus Known Signal” which is why the expression “Stochastic Parrot” arises.

Now consider the purpose of the “Noise” is to give a variable output based on a “filter Characteristic” probability (in effect an RMS over multidimensional vectors).

If we reduce the noise to zero we get the same signal each and every time as output. That is it becomes not just “time invariant” it becomes “fully deterministic” and the equivalent of a “read only database, look up”.

Now consider that to keep the hallucinations / hard bullshit down to tolerable levels the level of noise used has to likewise be quite low.

Hence the AI output will be more “Parrot” and less “stochastic”.

As the “parrot” comes from a “known dataset” it can be matched therefore act as a “distinguisher”.

Thus an AI would be able to recognise another AI’s output on a common “known dataset”.

The thing is that it would not be too hard to “grab the output” of a “Suspected AI” and use that to form a subset of it’s “known dataset”. The fact the Suspected AI will have a “Directed Purpose” will significantly reduce it’s output data set. Thus making “matching” it a lot simpler.

So sending all Suspect AI output to an “AI Sleuth” service that builds a dataset will also produce distinguishers not just for the AI type but it’s probable intent.

If you had such an AI Sleuth running on your personal device it would quickly give you a “probable” score.

Even without you actually having to listen to the Suspect AI.

You could even get your own AI to “play along and time waste” the Suspect AI to the point it’s costs are unfavourable to it’s “Directing Mind” operator.

ResearcherZero February 7, 2025 11:24 PM

Dear Artificial Intelligence chatbot,

Please answer the following questions in a robotic voice. Do a deep dive.

Will agents be allowed to pick and choose investigations?

What capabilities are we going to lose? What is the plan to replace that capability when they walk out the door?

‘https://abcnews.go.com/US/federal-government-buyouts-threaten-us-national-security-officials/story?id=118507244

[Hail, Sleet and Freezing Rain Warning]

Leave all “tips” in the shredder.

‘https://www.axios.com/local/washington-dc/2025/01/28/trump-executive-order-federal-workforce

reality matters!

https://theconversation.com/why-americans-need-well-informed-national-security-decisions-not-politicized-intelligence-analysis-248831

Great power competition and the risks of workforce attrition and alienation.
https://www.cnas.org/publications/commentary/managing-the-national-security-workforce-crisis

“it put the names of recent recruits in an unclassified email to the White House”

https://www.bloomberg.com/news/articles/2025-02-06/cia-list-of-new-recruits-risks-adversaries-exploiting-data

ResearcherZero February 7, 2025 11:51 PM

@Clive Robinson, @Bruce, ALL,

The machine did it! 😉 From now on we entirely use computers/AI/robots and missiles. Human sources be damned! If anything goes wrong we can simply blame it on our robot friends. 😐

Who? February 8, 2025 8:49 AM

I agree with a lot of people here, these examples sound more as the result of a bad sampling than as a robotic voice.

On the other hand, applying those watermarks to AI-generated voices is a useless effort now. The technology used to clone voices exists, so criminals will use it. Period.

Once the Pandora’s box has been opened there is no way to close it again.

If researchers at universities want to improve AI-generated voice identification, they should work more on filters that identify the patterns on a voice that has not been created by a human being than on ways to make these voices sound less human. These filters should run on real-time videoconferencing tools and phone communications, ideally being implemented on the devices themselves so there is no need to “listen” to conversations on a central point. Identifying non-human voices instead of watermarking them is the only answer to this post-Pandora’s box world.

Clive Robinson February 8, 2025 5:29 PM

@ ResearcherZero,

“… we can simply blame it on our robot friends.”

No friends of mine… I first started working with robots on the design side back in the 1980’s

Back then there were three types of robots,

1, Clock work toys for children.
2, Industrial arms for car manufacturing and the like.
3, Miniature arms for connecting to your 8bit home / school computer.

Of course I had played with the first when young 😉 But later I worked first with the big industrial arms one of which was made by a company called Puma, and later designed quite a number of systems interfaces for miniature arms.

However I still wake up some nights with memories of the Puma darn near killing me due to a programming error (a value that should have been unsigned int got treated as signed int thus the arm did a full arc sweep at considerable velocity).

I knew then that robots were at best “cold calculating machines” and would never be anyone’s friend, especially if programmed by those that typed rather than think…

The Zuckerberg / Facebook mantra of

“Move fast and break things”

Might sound good to some, but think about the implication of the thing that gets broken is your skull…

I guess a lesson Zuckerberg still needs to learn.

Rontea February 18, 2025 10:51 AM

Maybe we should be able to clearly distinguish between humans, machines, and magic.

I think that the modulator idea will increase transparency in AI interactions.

@Arbol
“There are probably other effects that would evoke the idea of a robot, but this one isn’t very good.”

It's a good start.

@Matthias Urlichs
“Those spectrum tag links are really annoying.”

I am glad I wasn't the only one thinking that.

@MB Watson
“I’m a bot, not a person. How can I help you?”

I have heard the phrase "automated assistant"

@Colin Ballast
“Hopefully we can prevent it from delivering us into the hands of fascism.”

"This machine kills fascism"

@Carl Breen
“AI can sound realistic, but please make AI not sound psychopathic by trying to manipulate me as listener.”

Manipulation? Credit, plan, signal

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.