Naia supports real-time voice conversations through Omni mode, in addition to text chat. Instead of the traditional STT (speech-to-text) → LLM (thinking) → TTS (text-to-speech) pipeline, Omni mode uses models that directly understand and respond with voice.
What is Omni Mode?
Traditional voice conversations go through three stages:
- STT: Convert user's speech to text
- LLM: Understand text and generate response
- TTS: Convert response text to speech
Omni mode processes all of this in a single model. It directly understands vocal tone, emotion, and nuance, responding with natural speech for faster, more natural conversations.
Gemini 2.5 Flash Live
Uses Google's Gemini 2.5 Flash model for real-time voice conversations.
Setup
- Go to Settings > AI and select Gemini or Naia Cloud as the provider
- Choose Gemini 2.5 Flash Live 🗣️ from the model list
- Direct connection requires a Google API key. Naia Cloud usage requires no separate key — just credits
Voice Selection
Choose from 8 voices:
| Voice | Character |
|---|---|
| Kore | Female, calm (default) |
| Puck | Male, lively |
| Charon | Male |
| Aoede | Female |
| Fenrir | Male |
| Leda | Female |
| Orus | Male |
| Zephyr | Neutral |
Features
- Very low latency via real-time WebSocket connection
- Full transcription provided for both user speech and AI responses
- Approximately $0.03/min cost (direct connection)
- Credits deducted when using Naia Cloud
vLLM + MiniCPM-o 4.5 (Local Omni)
Run omni voice conversations entirely locally. Uses OpenBMB's MiniCPM-o 4.5 model via a vLLM server.
Requirements
| Mode | VRAM | Notes |
|---|---|---|
| BF16 (recommended) | ~19GB | Full voice I/O support |
| INT4 | ~11GB | Text only, no voice output |
Setup
- vLLM Server: Run the MiniCPM-o server on a machine with GPU
cd naia-os/voice-server/minicpm-o
bash setup.sh # Download model + install dependencies
python server.py # Start server in BF16 mode
-
Naia App: Go to Settings > AI, select vLLM as provider, and choose the MiniCPM-o 🗣️ model
-
Server URL: Enter the server address (e.g.,
ws://localhost:8765/wsfor local)
Remote GPU (RunPod, etc.)
If you don't have a local GPU, use cloud GPU services like RunPod:
# On RunPod instance
git clone https://github.com/nextain/naia-os.git
cd naia-os/voice-server/minicpm-o
bash setup.sh
python server.py --host 0.0.0.0
Set the server URL to wss://<pod-id>-8765.proxy.runpod.net in Naia settings.
Known Limitations
- Currently supports English and Chinese only (no Korean STT)
- Half-duplex only — cannot speak and listen simultaneously
- Voice output does not work in INT4 quantization mode
STT → LLM → TTS Pipeline Setup
If you're not using an omni model, you can configure STT and TTS individually for voice conversations.
STT (Speech Recognition)
Select an STT provider in Settings > Voice:
| Provider | Offline | API Key | Cost | Notes |
|---|---|---|---|---|
| Web Speech API | No | Not needed | Free | Browser built-in, availability varies by OS |
| Vosk | Yes | Not needed | Free | Real-time streaming, ~40-80MB model |
| Whisper | Yes | Not needed | Free | High accuracy, GPU accelerated |
| Naia Cloud | No | Not needed (Naia login) | Credits | Cloud STT |
| vLLM ASR | Yes (local) | Not needed | Free | Self-hosted server required |
Offline providers (Vosk, Whisper) work without internet.
TTS (Speech Synthesis)
Select a TTS provider in Settings > Voice:
| Provider | Offline | API Key | Cost | Notes |
|---|---|---|---|---|
| Browser TTS | Yes | Not needed | Free | OS default voices |
| Edge TTS | No | Not needed | Free | Microsoft Edge voices |
| Naia Cloud TTS | No | Not needed (Naia login) | Credits | Google Chirp3 HD voices |
| Google Cloud TTS | No | Required | $0.016/1K chars | Neural2/WaveNet voices |
| OpenAI TTS | No | Required | $0.015/1K chars | OpenAI voices (Alloy + 12 more) |
| ElevenLabs | No | Required | $0.30/1K chars | Premium synthesis (50+ voices) |
| vLLM TTS | Yes (local) | Not needed | Free | Self-hosted server required |
Preview
After selecting a TTS provider, click the Preview button to test the selected voice.
Example Combinations
Various combinations are possible depending on your setup:
| Environment | STT | LLM | TTS | Cost |
|---|---|---|---|---|
| Completely Free | Vosk | Ollama (local) | Browser TTS | Free |
| Budget-Friendly | Web Speech | Gemini 2.5 Flash | Edge TTS | ~$0.3/1M tokens |
| High Quality | Whisper | Claude Sonnet | ElevenLabs | API costs |
| Naia Cloud | Naia Cloud STT | Naia Cloud LLM | Naia Cloud TTS | Credits only |
| Fully Local (GPU) | vLLM ASR | vLLM | vLLM TTS | Free |