Voice Chat — Naia Manual

Naia supports real-time voice conversations through Omni mode, in addition to text chat. Instead of the traditional STT (speech-to-text) → LLM (thinking) → TTS (text-to-speech) pipeline, Omni mode uses models that directly understand and respond with voice.

What is Omni Mode?

Traditional voice conversations go through three stages:

STT: Convert user's speech to text
LLM: Understand text and generate response
TTS: Convert response text to speech

Omni mode processes all of this in a single model. It directly understands vocal tone, emotion, and nuance, responding with natural speech for faster, more natural conversations.

Gemini 2.5 Flash Live

Uses Google's Gemini 2.5 Flash model for real-time voice conversations.

Setup

Go to Settings > AI and select Gemini or Naia Cloud as the provider
Choose Gemini 2.5 Flash Live 🗣️ from the model list
Direct connection requires a Google API key. Naia Cloud usage requires no separate key — just credits

Voice Selection

Choose from 8 voices:

Voice	Character
Kore	Female, calm (default)
Puck	Male, lively
Charon	Male
Aoede	Female
Fenrir	Male
Leda	Female
Orus	Male
Zephyr	Neutral

Features

Very low latency via real-time WebSocket connection
Full transcription provided for both user speech and AI responses
Approximately $0.03/min cost (direct connection)
Credits deducted when using Naia Cloud

vLLM + MiniCPM-o 4.5 (Local Omni)

Run omni voice conversations entirely locally. Uses OpenBMB's MiniCPM-o 4.5 model via a vLLM server.

Requirements

Mode	VRAM	Notes
BF16 (recommended)	~19GB	Full voice I/O support
INT4	~11GB	Text only, no voice output

Setup

vLLM Server: Run the MiniCPM-o server on a machine with GPU

cd naia-os/voice-server/minicpm-o
bash setup.sh          # Download model + install dependencies
python server.py       # Start server in BF16 mode

Naia App: Go to Settings > AI, select vLLM as provider, and choose the MiniCPM-o 🗣️ model
Server URL: Enter the server address (e.g., ws://localhost:8765/ws for local)

Remote GPU (RunPod, etc.)

If you don't have a local GPU, use cloud GPU services like RunPod:

# On RunPod instance
git clone https://github.com/nextain/naia-os.git
cd naia-os/voice-server/minicpm-o
bash setup.sh
python server.py --host 0.0.0.0

Set the server URL to wss://<pod-id>-8765.proxy.runpod.net in Naia settings.

Known Limitations

Currently supports English and Chinese only (no Korean STT)
Half-duplex only — cannot speak and listen simultaneously
Voice output does not work in INT4 quantization mode

STT → LLM → TTS Pipeline Setup

If you're not using an omni model, you can configure STT and TTS individually for voice conversations.

STT (Speech Recognition)

Select an STT provider in Settings > Voice:

Provider	Offline	API Key	Cost	Notes
Web Speech API	No	Not needed	Free	Browser built-in, availability varies by OS
Vosk	Yes	Not needed	Free	Real-time streaming, ~40-80MB model
Whisper	Yes	Not needed	Free	High accuracy, GPU accelerated
Naia Cloud	No	Not needed (Naia login)	Credits	Cloud STT
vLLM ASR	Yes (local)	Not needed	Free	Self-hosted server required

TTS (Speech Synthesis)

Select a TTS provider in Settings > Voice:

Provider	Offline	API Key	Cost	Notes
Naia Cloud TTS	No	Not needed (Naia login)	Credits	Google Chirp3 HD voices
Edge TTS	No	Not needed	Free	Microsoft Edge voices
Google Cloud TTS	No	Required	$0.016/1K chars	Neural2/WaveNet voices
OpenAI TTS	No	Required	$0.015/1K chars	OpenAI voices (Alloy + 12 more)
ElevenLabs	No	Required	$0.30/1K chars	Premium synthesis (50+ voices)
vLLM TTS	Yes (local)	Not needed	Free	Self-hosted server required
Custom	Varies	Varies	Varies	User-defined endpoint

Preview

After selecting a TTS provider, click the Preview button to test the selected voice.

Example Combinations

Various combinations are possible depending on your setup:

Environment	STT	LLM	TTS	Cost
Completely Free	Vosk	Ollama (local)	Edge TTS	Free
Budget-Friendly	Web Speech	Gemini 2.5 Flash	Edge TTS	~$0.3/1M tokens
High Quality	Whisper	Claude Sonnet	ElevenLabs	API costs
Naia Cloud	Naia Cloud STT	Naia Cloud LLM	Naia Cloud TTS	Credits only
Fully Local (GPU)	vLLM ASR	vLLM	vLLM TTS	Free

16. Voice Chat

What is Omni Mode?

Gemini 2.5 Flash Live

Setup

Voice Selection

Features

vLLM + MiniCPM-o 4.5 (Local Omni)

Requirements

Setup

Remote GPU (RunPod, etc.)

Known Limitations

STT → LLM → TTS Pipeline Setup

STT (Speech Recognition)

TTS (Speech Synthesis)

Preview

Example Combinations