Skip to content

Voice Channel

Human supports real-time voice interaction through the built-in voice channel and the web dashboard, enabling hands-free conversations with your AI assistant.

The voice channel provides:

  • Speech-to-text (STT) — converts spoken audio to text for processing
  • Text-to-speech (TTS) — reads responses aloud via Cartesia streaming
  • Gemini Live — native end-to-end voice via Google’s Multimodal Live API (no STT/TTS pipeline)
  • WebSocket streaming — low-latency bidirectional audio
  • Manual VAD — client-managed voice activity detection for precise turn-taking
  • Affective dialog — emotion-aware responses that match user tone (Gemini Live)
ModeHow it worksLatency
Standard (Cartesia)Mic → STT → Agent → TTS → SpeakerMedium
Gemini LiveMic → Gemini 3.1 Flash Live → SpeakerLow
RealtimeOpenAI Realtime API (full-duplex)Low

Voice settings live under the top-level voice key in config.json:

{
"voice": {
"mode": "gemini_live",
"realtime_model": "gemini-3.1-flash-live-preview",
"realtime_voice": "Puck",
"tts_provider": "cartesia",
"tts_voice": "your-voice-id",
"tts_model": "sonic-2",
"stt_provider": "whisper",
"stt_model": "whisper-1"
}
}

For production use with Vertex AI (recommended):

{
"voice": {
"mode": "gemini_live",
"realtime_model": "gemini-3.1-flash-live-preview",
"realtime_voice": "Kore",
"vertex_region": "us-central1",
"vertex_project": "your-project-id",
"vertex_access_token": ""
}
}

When vertex_region and vertex_project are set, Human uses the Vertex AI endpoint with Application Default Credentials (ADC). Leave vertex_access_token empty to use ADC automatically, or provide a pre-fetched OAuth2 token.

FieldDescriptionDefault
modeVoice mode: gemini_live, realtime, sonata, webrtc
realtime_modelModel for Gemini Live or OpenAI Realtimegemini-3.1-flash-live-preview
realtime_voiceVoice name for the realtime modelPuck
tts_providerTTS provider for standard mode
tts_voiceVoice ID for Cartesia TTS
tts_modelTTS model name
stt_providerSTT provider for standard mode
stt_modelSTT model name
vertex_regionVertex AI region (enables Vertex endpoint)
vertex_projectVertex AI project ID
vertex_access_tokenPre-fetched OAuth2 token (empty = use ADC)

Voice support is included by default. To explicitly enable:

Terminal window
cmake .. -DHU_ENABLE_VOICE=ON
ProviderSTTTTSNotes
Gemini LiveNativeNativeEnd-to-end multimodal, lowest latency
OpenAI WhisperBest transcription accuracy
CartesiaStreaming TTS with emotion controls
OpenAI TTSMultiple voices
DeepgramLow latency

Audio data is streamed directly to the configured provider and is not stored locally. All connections use TLS. Gemini Live uses Vertex AI with ADC for production authentication — API keys are supported for development only.