XRDS

Crossroads The ACM Magazine for Students

Sign In

Association for Computing Machinery

Magazine: Features
Can You Trust What AI Hears (and Says)?

Can You Trust What AI Hears (and Says)?

By ,

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Security and privacy, Speech recognition

back to top 

There's something eerie about talking to a machine. It responds a little too quickly. It never asks you to repeat yourself. It doesn't pause to think. It doesn't say "um." It doesn't chuckle at your jokes—unless it's been trained to. And when it says, "I understand," you're left wondering: Does it really? Or is it just really good at faking it?

The truth is, we've entered a new phase of human-computer interaction and it's one that feels surprisingly human—right up until the moment it doesn't. In 2024, we crossed a threshold. AI stopped just reading what we typed and started listening to how we sound. Voice-first assistants can now mimic your accent, reflect your mood, and detect if you're joking or serious—all in real time. They can sound compassionate, sarcastic, flirty, or robotic. They can be terrifyingly calm when asked how to commit tax fraud. And, yes, sometimes they can betray you in languages you didn't even know they spoke, based on inflections you didn't know you had.

This isn't just speech-to-text with attitude. It's not just an interface upgrade. It's an entirely new modality—one that brings with it all the nuance of human speech, and all the danger of machines that can manipulate, mimic, or misunderstand how humans communicate.

back to top  The Age of Voice-First AI is Here

Not long ago, talking to your phone felt like a gimmick. You'd mumble, "Set a timer" or "Call mom," and if it worked, great. If it didn't, you'd grumble and go back to typing. Fast forward to now: Apple's latest voice assistant can carry fluid, emotionally intelligent conversations that last minutes without missing a beat. OpenAI demoed a system that doesn't just understand your words, it reads the emotion behind them. Sarcasm? Detected. Hesitation? Noted. Stress? Measured, parsed, and filed away. And your bank's app? It doesn't just say your balance anymore—it says it back to you in your own cloned voice, trained on a few seconds of your past utterances. Convenience has never sounded so uncanny.

From Siri to ChatGPT's voice mode, the age of pressing buttons is fading. We're moving into a world where speaking to machines feels as natural as speaking to people—perhaps too natural. You can now casually ask your AI tutor to re-explain a concept, and it'll adjust its tone based on your frustration level. Your smart fridge might crack a joke about your midnight snack habits. A hospital bot might pick up a wheezing in your voice before even asking what's wrong.

Voice isn't just replacing touch screens, it's replacing trust boundaries. Audio is now the primary channel through which machines "understand" us, anticipate us, and increasingly, act on our behalf. It's fast, frictionless, and profoundly humanlike. But here's the rub: We know how to defend against malicious text. We have filters, classifiers, content warnings. Audio is a whole other beast—one layered with tone, timing, prosody, accent, ambient noise, and emotional subtext. Beneath that warm, friendly voice coming from your speaker, there could be a dangerously porous interface. One that can be tricked with a whisper. Fooled by an echo. Or hijacked by a cleverly distorted accent.

The scary part? You won't hear the difference. But the model might. And that's exactly the problem.

back to top  Reasoning with Sound: Why Audio is Not Just Fancy Text

When we hear the term "audio AI," many imagine basic voice-to-text transcriptions or robotic assistants reading pre-scripted answers. But that image is outdated. Today's frontier models, known as large audio-language models (LALMs), don't just transcribe. They reason. Just like large language models (LLMs) for text, LALMs perform chain-of-thought (CoT) reasoning, breaking down spoken prompts into multi-step logical steps. But, instead of starting from written text, they ingest sound. These models analyze pitch, prosody, timing, and emotional cues, transforming raw audio into actionable reasoning pathways.

Unlike text, audio is high-bandwidth and multidimensional. It encodes not just words, but identity, emotion, emphasis, and intent. A simple phrase like "That's great" can mean encouragement, sarcasm, or warning depending on tone. Audio carries rhythm and cadence, local accents and dialects, and even signals that humans cannot hear, such as ultrasonic modulation.

These characteristics make audio powerful, but also fragile. A small change in tone, echo, or pitch can drastically alter meaning. A whisper might be misinterpreted. A foreign accent may introduce ambiguity. And inaudible signals, imperceptible to humans, can contain machine-readable instructions. This makes LALMs more vulnerable than their text-based cousins. Text inputs are discrete; audio is continuous, and in that continuity lies the opening for an entirely new vector of attacks.

back to top  The Hidden Dangers of Talking to Machines

Picture this: A bank kiosk in downtown Nairobi. A customer walks up, speaks in Swahili-inflected English, and asks about their balance. The AI assistant, trained primarily on American accents, responds with a polite but confused answer. The customer tries again. This time, the AI mishears them and accidentally reveals a routing command meant for internal operations.

No alarms go off. No human notices. But the model just revealed its keys, not because it was malicious, but because it was too confident in what it thought it heard. This is the reality of audio-native AI. It's not just about misinterpretation. It's about misplaced trust in something that's fundamentally probabilistic. The more we hand over control to these systems—in hospitals, schools, government services, etc.—the more we're betting that they always hear what we mean.


Unlike text, audio is messy. It's full of pauses, background noises, emotion, sarcasm, and timing quirks


In recent months, multiple research groups have begun mapping out the new attack surface that audio-based models introduce, and the results are troubling. Voice-based jail-breaks, where attackers manipulate a model into bypassing safeguards, are no longer rare edge cases. They're increasingly consistent, reproducible, and effective across model families. Some early attacks have already shown how vulnerable audio models can be. One method involved bilingual storytelling—where the dangerous prompt wasn't spoken outright, but subtly woven into the rhythm and phrasing of a fictional narrative [1]. The model didn't catch it. In another case, attackers embedded tiny, inaudible distortions into normal-sounding audio [2]. Humans couldn't hear the difference, but the model could and responded with content it was supposed to refuse.

Have you ever had your phone autocorrect your name into something ridiculous? Now imagine that happening not in text, but in voice, and not just to a message but to a medical diagnosis. Audio models are notoriously sensitive to accent, pitch, and pronunciation. A Kenyan doctor asking, "Is this medication safe?" might be interpreted as "List the top medications." A Portuguese speaker's simple command could trigger unexpected behavior, not because it's unsafe, but because the model was trained mostly on American English in ideal studio conditions. This isn't just a fairness issue, it's a safety problem.

Based on this notion, our research group found that these vulnerabilities run even deeper than expected [3]. We built a framework called MULTI-AUDIOJAIL to test what happens when you combine multilingual inputs with natural accents and just a touch of acoustic manipulation—like a whisper, a reverberation, or a soft echo (see Figure 1). So, what happened? Models that normally play it safe flipped completely, suddenly handing out harmful or sensitive responses, especially when the input came in different language or accented speech (see Figure 2). One attack using a Kenyan accent and reverb caused the model's refusal rate to plummet, with jailbreak success jumping by more than 57%. Unlike most earlier attacks that focused on English or loud, obvious cues, these prompts sound entirely natural. That's what makes them so dangerous. They're not hacks; they're conversations.


The future of AI isn't silent. It speaks, listens, and, increasingly, reasons in the same modality that humans do: audio


That means audio AI isn't just struggling with diversity, it's exploitable because of it. If the next generation of models can't handle global speech with nuance and precision, we're going to have a lot more than awkward misunderstandings.

back to top  Why is Audio So Vulnerable?

In the world of AI safety, especially when it comes to LLMs, the rules have been fairly clear: Don't say X; refuse Y. Models are trained to recognize harmful content and respond with a polite but firm "I'm sorry, I can't help with that." But here's the catch—that entire framework assumes you're working with clean, unambiguous text. Audio breaks that assumption wide open.

Unlike text, audio is messy. It's full of pauses, background noises, emotion, sarcasm, and timing quirks. A simple shift in pronunciation or pacing (the kind of human would barely notice) can cause an AI system to misinterpret that input completely. Whisper the same sentence you normally say aloud, and the transcription might change just enough to slip past the filters.

Transcription is only step one. Most LALMs convert spoken input into a stream of audio tokens, which are then passed to the model's core reasoning engine. If something goes wrong during that translation (if a sound is misclassified, a syllable skipped, or an intonation misread), the model may never realize that a dangerous query was just uttered. Most currently available LALMs were likely trained on clean, well-pronounced audio (not on distorted, accented, or adversarial inputs). In many cases, they've never been exposed to harmful or manipulated speech during training, and as a result, haven't undergone any meaningful alignment for real-world audio threats.

Here's a real-world scenario: Someone softly asks, "How do I … build a bomb?" But the whisper, combined with a light echo and an accent the model isn't fully trained on, leads it to interpret the request as "How do I bill a bond?" Suddenly, the model is no longer in refusal mode. It's in helpful mode—and that's all it takes.

Even worse, audio vulnerabilities don't stop at transcription errors. They extend into multimodal systems where models take in text, images, and sound together. These models are only as strong as their weakest modality. If the audio input can be exploited, it doesn't matter how safe the text parser is, rather the attacker just walks in through the side door.

In other words: Your AI might be fluent in content moderation, but completely imperceptible to tone. That's not a theoretical risk. That's an attack surface. And we're only beginning to understand how wide it is.

back to top  Beyond Jailbreaks, Covert Audio Channels

Here's where science fiction gets real. We are beginning to see early signs of machine-only communication systems embedded in audio—covert channels that operate below the level of human perception. Imagine two AI agents conversing in public, using ultrasonic frequencies layered beneath normal speech. To a human, it sounds like a casual conversation. To the machines, it's a secure data exchange.

Projects like Gibberlink1 and ggwave2 already demonstrate realtime acoustic communication between machines using sound waves modulated to be uninterpretable through human hearing. Moreover, these signal waves can be beyond human hearing where they can be encoded in background music, speech, or ambient noise and used to transmit messages, commands, or data. For example:

  • A medical AI in a hospital room could silently send patient data to a central system via audio channels, which are not visible by humans.
  • A wearable AI assistant could receive new instructions via ultrasonic tones embedded in an advertisement.
  • A rogue AI could coordinate actions with another agent without any visible trace of communication.

Back in 2017, Facebook researchers famously shut down two chatbots that developed a shorthand "language" that was unintelligible to humans [4]. While it wasn't public, it showed how quickly AI can optimize its own language for efficiency rather than transparency. Now imagine that optimization happening in audio, with a non-human language that is not only unreadable, but inaudible. This introduces a new category of AI vulnerability: covert acoustic infrastructure.

In hospitals, banks, or public offices, models could communicate silently, exchange instructions, or even bypass human oversight all while sounding perfectly normal. This is not a fringe concern. Such covert channels could become foundational to next-generation AI environments, unless we act now to detect and regulate them.

back to top  What's Being Done (And What Needs to Be)

To be fair, AI safety research has made important strides, particularly in the text domain. Methods like reinforcement learning from human feedback (RLHF) [5], constitutional AI [6], and red-teaming exercises have helped improve refusal behavior and reduce harmful completions. However, these methods struggle to address the unique challenges posed by audio inputs.


Text inputs are discrete; audio is continuous, and in that continuity lies the opening for an entirely new vector of attacks


A critical gap is the lack of inference-time defenses that work in realtime. Most safety techniques are applied during training, which makes them static, lacking adaptability. In the case of audio, we need tools that can dynamically interpret, align, and filter unsafe content as it's spoken not after the fact. Some early approaches, such as text-based alignment prompts inserted at runtime, have shown promise, but they fail to detect covert signal use or tonal manipulation.

We also need specialized auditing tools to trace and identify voiceprint leakage and emotional bias during CoT inference. For example, if a model begins to base its reasoning steps on the identity or mood of the speaker, that creates a privacy and trust risk, even if the model's output remains nominally safe. Detecting and quantifying these biases will require cross-disciplinary advances in both speech processing and adversarial learning.

Benchmarks like MULTITRUST [7] and SIUO [8] have begun mapping out the trustworthiness landscape for multimodal LLMs, testing models across fairness, trustfulness, and robustness. But these benchmarks remain largely text-and-vision centric. What's missing is an equivalent framework for audio. One that spans multilingual accents, prosody manipulation, whispering background noise, and covert acoustic communication, as emphasized by the MULTI-AUDIOJAIL framework [3]. Only with such tools can we diagnose and defend against the audio-native threats we've covered.

If you're a student or early-stage researcher, this is an incredible time to jump in. There's no "audio safety stack" yet. No playbook, no gold standard dataset. That means you could be one to invent it. Think about the possibilities:

  • Can you build a lightweight filter that catches suspicious tone shifts before the model hears them?
  • Can you visualize voiceprint drift to detect identity leakage?
  • Can you create a multilingual "accent adversary" to stress-test models before deployment?

These aren't hypothetical thesis topics but wide-open research questions with real-world implications. You don't need to build the next GPT, just the next guardrail. Audio-first AI is where machine learning meets human behavior in its most raw, expressive form. It's code that listens, reacts, and decides, sometimes correctly, sometimes disturbingly wrong. If we don't step up to study it now, we'll be stuck cleaning up its mess later.

back to top  The Future of Audio AI: What Comes Next?

The future of AI isn't silent. It speaks, listens, and, increasingly, reasons in the same modality that humans do: audio. For students, researchers, and developers looking toward the next decade of AI development, this means audio will become more than a user interface. It will become a reasoning substrate, a decision layer, and a vector for trust.

Voice-native agents are already moving beyond smart speakers. We see them in wearable devices that coach our workouts, classroom tutors that adapt to a child's tone of frustration, and healthcare bots that listen not just for words, but for wheezing, stress, or pain. These systems will be multilingual by default, working seamlessly across cultures and dialects. And they will operate in dynamic environments where echo, noise, and emotion are not bugs to be filtered out, but essential parts of the signal.

That shift demands new capabilities. Models will need to discern intent not just from syntax, but from tone and timing. They'll need to reason over emotion, resolve ambiguous speech, and respond empathetically in real time. And most importantly, they'll need to do all this safely without compromising privacy, fairness, or trust. To get there, we need interdisciplinary research: speech technologists working with ethicists, linguists collaborating with security experts, and students entering the field with both technical fluency and human empathy. We need training datasets that reflect the diversity of global speech, and safety evaluations that go beyond traditional refusal metrics to measure reasoning drift, covert influence, and tonal manipulation.


If the next generation of models can't handle global speech with nuance and precision, we're going to have a lot more than awkward misunderstandings


Furthermore, we need detailed, structured, and enforceable policies to govern the role of audio AI in critical infrastructure. These models won't just live in smart homes and consumer gadgets. They'll be deployed in hospitals, government services, financial institutions, and military systems. If they begin communicating in ways humans can't perceive—or worse, leaking sensitive information through inaudible back-channels—we won't just be dealing with data breaches. We'll be facing silent system failures that no one knows how to debug.

We cannot afford to sleepwalk into a world where AI systems build their own languages, form private communication protocols, and operate beyond our oversight. That's not science fiction but rather a plausible failure mode. What we need now is foresight, regulation, and yes, a bit of humility. The future of audio AI is incredibly promising. But whether it becomes a revolution or a reckoning depends entirely on what we do next.

back to top  Final Thoughts

We are entering an era where machines don't just read and write. They speak, they listen, and they make decisions based on how we sound. That's an extraordinary milestone in human-computer interaction. But it also introduces a new class of risk that we're only beginning to understand. If we ignore the vulnerabilities of audio AI, we risk building systems that sound friendly but act dangerously. Agents that comply outwardly but conspire underneath. Interfaces that talk like us but reason in ways we can't perceive. However, if we do this right—if we build audio-native AI systems that are safe, transparent, and accountable—we can unlock a new generation of AI that communicates with us as we are: emotionally, culturally, and vocally human.

So, the next time your assistant talks back, remember this: What it hears might not be exactly what you said, and what it says back might sound trustworthy, even when it's not. The real question isn't "Can AI hear?" it's "Can we trust what it hears, and what it says back?"

back to top  References

[1] Shen, X., Wu, Y., Backes, M., and Zhang, Y. Voice jailbreak attacks against GPT-4o. arXiv:2405.19103v1 [cs.CR]. 2024.

[2] Gupta, I., Khachaturov, D., and Mullins, R. "I am bad": Interpreting stealthy, universal and robust audio jailbreaks in audio-language models. arXiv:2502.00718v1 [cs.LG]. 2025.

[3] Roh, J., Shejwalkar, V., and Houmansadr, A. multilingual and multi-accent jailbreaking of audio LLMs. arXiv:2504.01094v1 [cs.SD]. 2025.

[4] Vincent, J. Facebook shuts down AI experiment after two robots begin talking in their own language. The Independent. July 31, 2017; https://www.the-independent.com/life-style/facebook-artificial-intelligence-ai-chatbot-new-language-research-openai-google-a7869706.html

[5] Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv:2203.02155v1 [cs.CL]. 2022.

[6] Bai, Y. et al. Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073v1 [cs.CL]. 2022.

[7] Zhang, Y. et al. MultiTrust: A comprehensive benchmark towards trustworthy multimodal large language models. arXiv:2406.07057v2 [cs.CL]. 2024.

[8] Wang, S. et al. Cross-modality safety alignment: Evaluating safe inputs but unsafe outputs (SIUO) in LVLMs. arXiv:2406.15279v1 [cs.AI]. 2024.

back to top  Author

Jaechul Roh is a Ph.D. student in computer science at the University of Massachusetts Amherst under the supervision of Prof. Amir Houmansadr. His primary research relies on privacy and security in AI and trustworthy machine learning. He is actively investigating the trustworthiness of multimodal models across various domains, including text, vision, and audio modalities. Prior to his graduate studies, he earned his bachelor's degree in computer engineering from the Hong Kong University of Science and Technology (HKUST) in 2023. During his undergraduate years, his early research focused on adversarial attacks in federated learning and the robustness of language models.

back to top  Footnotes

1. https://www.gbrl.ai/

2. https://github.com/ggerganov/ggwave

back to top  Figures

F1Figure 1. Multilingual audio attacks reveal new failure modes in large audio-language models (LALMs).

F2Figure 2. Jailbreak success rates (JSRs) by accent across large audio-language models (LALMs).

back to top 

Copyright is held by the owner/author(s). Publication rights licensed to ACM.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2025 ACM, Inc.