Voice
Realtime voice conversations with transcription and playback.
Voice support is built into @robojs/ai. No separate plugin needed. Your bot can join Discord voice channels and have realtime spoken conversations with users, with optional transcript embeds posted to text channels.
Voice requires a few optional dependencies that are loaded lazily -- they're only needed when a voice session actually starts.
Dependencies
Install four packages alongside the plugin:
npm install @discordjs/voice prism-media opusscript wsopusscript can be replaced with @discordjs/opus for better performance in production. ws is required for the OpenAI Realtime API WebSocket connection.
Setup
Add voice settings to your plugin config alongside an engine that supports realtime audio:
import { OpenAiEngine } from '@robojs/ai/engines/openai'
export default {
engine: new OpenAiEngine({
voice: {
model: 'gpt-realtime',
transcription: { language: 'en', model: 'gpt-4o-transcribe' }
}
}),
voice: {
playbackVoice: 'alloy',
endpointing: 'server-vad',
instructions: 'Keep spoken replies concise and under 10 seconds.',
transcript: { enabled: true }
}
}import { OpenAiEngine } from '@robojs/ai/engines/openai'
export default {
engine: new OpenAiEngine({
voice: {
model: 'gpt-realtime',
transcription: { language: 'en', model: 'gpt-4o-transcribe' }
}
}),
voice: {
playbackVoice: 'alloy',
endpointing: 'server-vad',
instructions: 'Keep spoken replies concise and under 10 seconds.',
transcript: { enabled: true }
}
}The engine.voice block configures the realtime model and transcription. The top-level voice block controls Discord-side behavior like playback voice, endpointing strategy, and transcript output.
Voice configuration
Top-level options
Prop
Type
Capture config
Nested under capture. Controls how user audio is processed before reaching the engine.
Prop
Type
Playback config
Nested under playback. Controls how engine audio is sent to Discord.
Prop
Type
Transcript config
Nested under transcript. Controls transcript embed output.
Prop
Type
Per-guild overrides
Override any voice setting for specific guilds using the perGuild map:
export default {
voice: {
playbackVoice: 'alloy',
perGuild: {
'123456789012345678': {
playbackVoice: 'verse',
maxConcurrentChannels: 1
}
}
}
}export default {
voice: {
playbackVoice: 'alloy',
perGuild: {
'123456789012345678': {
playbackVoice: 'verse',
maxConcurrentChannels: 1
}
}
}
}Per-guild values are merged on top of the base voice config. Any option not specified in the guild override inherits the base value.
Endpointing strategies
The endpointing option determines how the system detects when a user has finished speaking.
Server VAD (recommended)
endpointing: 'server-vad'endpointing: 'server-vad'Server-managed voice activity detection. Audio subscriptions remain active and the engine determines turn boundaries automatically. This produces the most natural conversation flow -- users speak freely and the bot responds when it detects a pause.
The capture.silenceDurationMs and capture.vadThreshold settings fine-tune sensitivity. Lower vadThreshold values pick up quieter speech but may trigger on background noise.
Manual
endpointing: 'manual'endpointing: 'manual'Explicit speech end markers. Requires a commit() call to signal that the user has finished speaking and trigger a response. Suited for push-to-talk style interactions where you want precise control over turn boundaries.
Programmatic control
The AI class exposes methods for managing voice sessions at runtime.
Joining and leaving
import { AI } from '@robojs/ai'
// Join a voice channel
await AI.startVoice({ guildId: guild.id, channelId: voiceChannel.id })
// Leave the voice channel
await AI.stopVoice({ guildId: guild.id })import { AI } from '@robojs/ai'
// Join a voice channel
await AI.startVoice({ guildId: guild.id, channelId: voiceChannel.id })
// Leave the voice channel
await AI.stopVoice({ guildId: guild.id })Runtime config changes
Patch voice settings without restarting:
import { AI } from '@robojs/ai'
// Change the playback voice globally
await AI.setVoiceConfig({
patch: { playbackVoice: 'echo' }
})
// Change settings for a specific guild
await AI.setVoiceConfig({
guildId: '123456789012345678',
patch: { maxConcurrentChannels: 1 }
})import { AI } from '@robojs/ai'
// Change the playback voice globally
await AI.setVoiceConfig({
patch: { playbackVoice: 'echo' }
})
// Change settings for a specific guild
await AI.setVoiceConfig({
guildId: '123456789012345678',
patch: { maxConcurrentChannels: 1 }
})Status and metrics
import { AI } from '@robojs/ai'
const status = await AI.getVoiceStatus()
const metrics = AI.getVoiceMetrics()import { AI } from '@robojs/ai'
const status = await AI.getVoiceStatus()
const metrics = AI.getVoiceMetrics()Voice events
Subscribe to voice lifecycle events for logging, analytics, or custom behavior:
import { AI } from '@robojs/ai'
export default () => {
AI.onVoiceEvent('session:start', (status) => {
// Voice session started
})
AI.onVoiceEvent('session:stop', (payload) => {
// payload.reason contains the disconnect reason
})
AI.onVoiceEvent('config:change', (payload) => {
// payload.config contains the updated voice config
})
AI.onVoiceEvent('transcript:segment', (payload) => {
// payload.segment.text contains the transcribed text
})
}import { AI } from '@robojs/ai'
export default () => {
AI.onVoiceEvent('session:start', (status) => {
// Voice session started
})
AI.onVoiceEvent('session:stop', (payload) => {
// payload.reason contains the disconnect reason
})
AI.onVoiceEvent('config:change', (payload) => {
// payload.config contains the updated voice config
})
AI.onVoiceEvent('transcript:segment', (payload) => {
// payload.segment.text contains the transcribed text
})
}Unsubscribe with AI.offVoiceEvent() using the same event name and handler reference.
Audio pipeline
Understanding the audio pipeline helps with debugging and tuning.
Inbound (user to engine): Discord Opus (48kHz stereo) -> decode -> mono conversion -> downsample to 24kHz -> VAD threshold check -> engine
Outbound (engine to user): Engine output -> upsample to 48kHz -> Opus encode -> Discord playback
The playback pipeline rebuilds automatically if any component fails mid-stream. Barge-in is supported: when a user starts speaking, any in-progress assistant playback is interrupted.
Reconnection
Voice sessions reconnect automatically with exponential backoff, up to 3 retries. Transcript tails are persisted via Flashcore so conversation context can be restored after a reconnect.
If all retries are exhausted, the session ends and a session:stop event fires with the failure reason.
Permissions
The bot requires these Discord permissions for voice to work:
| Permission | Purpose |
|---|---|
CONNECT | Join voice channels |
SPEAK | Transmit audio in voice channels |
SEND_MESSAGES | Post transcript embeds to text channels |
SEND_MESSAGES is only required when transcript.enabled is true.
Voice metrics
AI.getVoiceMetrics() returns an object with runtime statistics:
Prop
Type
Troubleshooting
No audio from the bot. Verify all four dependencies are installed (@discordjs/voice, prism-media, opusscript, ws). Check that the bot has CONNECT and SPEAK permissions in the target voice channel.
Audio quality issues. Adjust capture.vadThreshold -- lower values (closer to 0) are more sensitive and pick up quieter speech. Higher values filter out more background noise but may clip soft-spoken users.
Transcript embeds not appearing. Confirm transcript.enabled is set to true in your voice config. Verify the bot has SEND_MESSAGES permission in the target text channel.
High latency or delayed responses. Switch to server-vad endpointing if you're using manual. Server-side VAD has lower overhead and produces faster turn transitions.
Session disconnects frequently. Check your network stability. The bot retries up to 3 times with exponential backoff. If the OpenAI Realtime API WebSocket is unstable, the ws package handles the connection -- make sure it's up to date.
