Naia
Table of Contents
  1. 1Video Manual
  2. 2Naia OS Live USB
  3. 3Installation
  4. 3.1Naia OS Installation (ISO)
  5. 3.2Linux App Installation
  6. 4Getting Started
  7. 5Main Screen
  8. 6Chat
  9. 7Conversation History
  10. 8Work Progress
  11. 9Skills
  12. 10Channels
  13. 11Agents
  14. 12Diagnostics
  15. 13Workspace
  16. 14Browser
  17. 15Panel Management
  18. 16Voice Chat
  19. 17Settings
  20. 18Tool Details
  21. 19Naia Account
  22. 20Troubleshooting
  23. 21Open Source Usage & Contribution

16. Voice Chat

Naia supports real-time voice conversations through Omni mode, in addition to text chat. Instead of the traditional STT (speech-to-text) → LLM (thinking) → TTS (text-to-speech) pipeline, Omni mode uses models that directly understand and respond with voice.

What is Omni Mode?

Traditional voice conversations go through three stages:

  1. STT: Convert user's speech to text
  2. LLM: Understand text and generate response
  3. TTS: Convert response text to speech

Omni mode processes all of this in a single model. It directly understands vocal tone, emotion, and nuance, responding with natural speech for faster, more natural conversations.

Gemini 2.5 Flash Live

Uses Google's Gemini 2.5 Flash model for real-time voice conversations.

Setup

  1. Go to Settings > AI and select Gemini or Naia Cloud as the provider
  2. Choose Gemini 2.5 Flash Live 🗣️ from the model list
  3. Direct connection requires a Google API key. Naia Cloud usage requires no separate key — just credits

Voice Selection

Choose from 8 voices:

VoiceCharacter
KoreFemale, calm (default)
PuckMale, lively
CharonMale
AoedeFemale
FenrirMale
LedaFemale
OrusMale
ZephyrNeutral

Features

  • Very low latency via real-time WebSocket connection
  • Full transcription provided for both user speech and AI responses
  • Approximately $0.03/min cost (direct connection)
  • Credits deducted when using Naia Cloud

vLLM + MiniCPM-o 4.5 (Local Omni)

Run omni voice conversations entirely locally. Uses OpenBMB's MiniCPM-o 4.5 model via a vLLM server.

Requirements

ModeVRAMNotes
BF16 (recommended)~19GBFull voice I/O support
INT4~11GBText only, no voice output

Setup

  1. vLLM Server: Run the MiniCPM-o server on a machine with GPU
cd naia-os/voice-server/minicpm-o
bash setup.sh          # Download model + install dependencies
python server.py       # Start server in BF16 mode
  1. Naia App: Go to Settings > AI, select vLLM as provider, and choose the MiniCPM-o 🗣️ model

  2. Server URL: Enter the server address (e.g., ws://localhost:8765/ws for local)

Remote GPU (RunPod, etc.)

If you don't have a local GPU, use cloud GPU services like RunPod:

# On RunPod instance
git clone https://github.com/nextain/naia-os.git
cd naia-os/voice-server/minicpm-o
bash setup.sh
python server.py --host 0.0.0.0

Set the server URL to wss://<pod-id>-8765.proxy.runpod.net in Naia settings.

Known Limitations

  • Currently supports English and Chinese only (no Korean STT)
  • Half-duplex only — cannot speak and listen simultaneously
  • Voice output does not work in INT4 quantization mode

STT → LLM → TTS Pipeline Setup

If you're not using an omni model, you can configure STT and TTS individually for voice conversations.

STT (Speech Recognition)

Select an STT provider in Settings > Voice:

ProviderOfflineAPI KeyCostNotes
Web Speech APINoNot neededFreeBrowser built-in, availability varies by OS
VoskYesNot neededFreeReal-time streaming, ~40-80MB model
WhisperYesNot neededFreeHigh accuracy, GPU accelerated
Naia CloudNoNot needed (Naia login)CreditsCloud STT
vLLM ASRYes (local)Not neededFreeSelf-hosted server required

Offline providers (Vosk, Whisper) work without internet.

TTS (Speech Synthesis)

Select a TTS provider in Settings > Voice:

ProviderOfflineAPI KeyCostNotes
Browser TTSYesNot neededFreeOS default voices
Edge TTSNoNot neededFreeMicrosoft Edge voices
Naia Cloud TTSNoNot needed (Naia login)CreditsGoogle Chirp3 HD voices
Google Cloud TTSNoRequired$0.016/1K charsNeural2/WaveNet voices
OpenAI TTSNoRequired$0.015/1K charsOpenAI voices (Alloy + 12 more)
ElevenLabsNoRequired$0.30/1K charsPremium synthesis (50+ voices)
vLLM TTSYes (local)Not neededFreeSelf-hosted server required

Preview

After selecting a TTS provider, click the Preview button to test the selected voice.

Example Combinations

Various combinations are possible depending on your setup:

EnvironmentSTTLLMTTSCost
Completely FreeVoskOllama (local)Browser TTSFree
Budget-FriendlyWeb SpeechGemini 2.5 FlashEdge TTS~$0.3/1M tokens
High QualityWhisperClaude SonnetElevenLabsAPI costs
Naia CloudNaia Cloud STTNaia Cloud LLMNaia Cloud TTSCredits only
Fully Local (GPU)vLLM ASRvLLMvLLM TTSFree