Naia
Table of Contents
  1. 1Video Manual
  2. 2Naia OS Live USB
  3. 3Installation
  4. 3.1Naia OS Installation (ISO)
  5. 3.2Linux App Installation
  6. 4Getting Started
  7. 4.1Naia Model Pricing
  8. 4.2naia-0.9-omni-24g Realtime Multimodal Module
  9. 4.3Live Demo
  10. 4.4Naia Model Download
  11. 4.5Naia Model — Developer Guide
  12. 4.6Naia Model Online (planned)
  13. 5Main Screen
  14. 6Chat
  15. 7Conversation History
  16. 8Work Progress
  17. 9Skills
  18. 10Channels
  19. 11Agents
  20. 12Diagnostics
  21. 13Workspace
  22. 14Browser
  23. 15Panel Management
  24. 16Voice Chat
  25. 17Settings
  26. 18Tool Details
  27. 19Naia Account
  28. 20Troubleshooting
  29. 21Open Source Usage & Contribution

16. Voice Chat

Naia supports real-time voice conversations through Omni mode, in addition to text chat. Instead of the traditional STT (speech-to-text) → LLM (thinking) → TTS (text-to-speech) pipeline, Omni mode uses models that directly understand and respond with voice.

What is Omni Mode?

Traditional voice conversations go through three stages:

  1. STT: Convert user's speech to text
  2. LLM: Understand text and generate response
  3. TTS: Convert response text to speech

Omni mode processes all of this in a single model. It directly understands vocal tone, emotion, and nuance, responding with natural speech for faster, more natural conversations.

Gemini 2.5 Flash Live

Uses Google's Gemini 2.5 Flash model for real-time voice conversations.

Setup

  1. Go to Settings > AI and select Gemini or Naia Cloud as the provider
  2. Choose Gemini 2.5 Flash Live 🗣️ from the model list
  3. Direct connection requires a Google API key. Naia Cloud usage requires no separate key — just credits

Voice Selection

Choose from 8 voices:

VoiceCharacter
KoreFemale, calm (default)
PuckMale, lively
CharonMale
AoedeFemale
FenrirMale
LedaFemale
OrusMale
ZephyrNeutral

Features

  • Very low latency via real-time WebSocket connection
  • Full transcription provided for both user speech and AI responses
  • Approximately $0.03/min cost (direct connection)
  • Credits deducted when using Naia Cloud

vLLM + MiniCPM-o 4.5 (Local Omni)

Run omni voice conversations entirely locally. Uses OpenBMB's MiniCPM-o 4.5 model via a vLLM server.

Requirements

ModeVRAMNotes
BF16 (recommended)~19GBFull voice I/O support
INT4~11GBText only, no voice output

Setup

  1. vLLM Server: Run the MiniCPM-o server on a machine with GPU
cd naia-os/voice-server/minicpm-o
bash setup.sh          # Download model + install dependencies
python server.py       # Start server in BF16 mode
  1. Naia App: Go to Settings > AI, select vLLM as provider, and choose the MiniCPM-o 🗣️ model

  2. Server URL: Enter the server address (e.g., ws://localhost:8765/ws for local)

Remote GPU (RunPod, etc.)

If you don't have a local GPU, use cloud GPU services like RunPod:

# On RunPod instance
git clone https://github.com/nextain/naia-os.git
cd naia-os/voice-server/minicpm-o
bash setup.sh
python server.py --host 0.0.0.0

Set the server URL to wss://<pod-id>-8765.proxy.runpod.net in Naia settings.

Known Limitations

  • Currently supports English and Chinese only (no Korean STT)
  • Half-duplex only — cannot speak and listen simultaneously
  • Voice output does not work in INT4 quantization mode

STT → LLM → TTS Pipeline Setup

If you're not using an omni model, you can configure STT and TTS individually for voice conversations.

STT (Speech Recognition)

Select an STT provider in Settings > Voice:

ProviderOfflineAPI KeyCostNotes
Web Speech APINoNot neededFreeBrowser built-in, availability varies by OS
VoskYesNot neededFreeReal-time streaming, ~40-80MB model
WhisperYesNot neededFreeHigh accuracy, GPU accelerated
Naia CloudNoNot needed (Naia login)CreditsCloud STT
vLLM ASRYes (local)Not neededFreeSelf-hosted server required

TTS (Speech Synthesis)

Select a TTS provider in Settings > Voice:

ProviderOfflineAPI KeyCostNotes
Naia Cloud TTSNoNot needed (Naia login)CreditsGoogle Chirp3 HD voices
Edge TTSNoNot neededFreeMicrosoft Edge voices
Google Cloud TTSNoRequired$0.016/1K charsNeural2/WaveNet voices
OpenAI TTSNoRequired$0.015/1K charsOpenAI voices (Alloy + 12 more)
ElevenLabsNoRequired$0.30/1K charsPremium synthesis (50+ voices)
vLLM TTSYes (local)Not neededFreeSelf-hosted server required
CustomVariesVariesVariesUser-defined endpoint

Preview

After selecting a TTS provider, click the Preview button to test the selected voice.

Example Combinations

Various combinations are possible depending on your setup:

EnvironmentSTTLLMTTSCost
Completely FreeVoskOllama (local)Edge TTSFree
Budget-FriendlyWeb SpeechGemini 2.5 FlashEdge TTS~$0.3/1M tokens
High QualityWhisperClaude SonnetElevenLabsAPI costs
Naia CloudNaia Cloud STTNaia Cloud LLMNaia Cloud TTSCredits only
Fully Local (GPU)vLLM ASRvLLMvLLM TTSFree