Naia
Table of Contents
  1. 1Video Manual
  2. 2Naia OS Live USB
  3. 3Installation
  4. 3.1Naia OS Installation (ISO)
  5. 3.2Linux App Installation
  6. 4Getting Started
  7. 4.1Naia Model Pricing
  8. 4.2naia-0.9-omni-24g Realtime Multimodal Module
  9. 4.3Live Demo
  10. 4.4Naia Model Download
  11. 4.5Naia Model — Developer Guide
  12. 4.6Naia Model Online (planned)
  13. 5Main Screen
  14. 6Chat
  15. 7Conversation History
  16. 8Work Progress
  17. 9Skills
  18. 10Channels
  19. 11Agents
  20. 12Diagnostics
  21. 13Workspace
  22. 14Browser
  23. 15Panel Management
  24. 16Voice Chat
  25. 17Settings
  26. 18Tool Details
  27. 19Naia Account
  28. 20Troubleshooting
  29. 21Open Source Usage & Contribution

4.5. Naia Model — Developer Guide

Developer guide for using the Naia model from code. After running the model via 4.4 Naia Model Download, use the OpenAI-compatible API served locally (no gateway, no queue) as-is. With any OpenAI SDK or tool, you only point the baseURL at this model.

Not naia-os/shell-only — any code that speaks OpenAI Realtime/Chat/Audio/Embeddings connects as-is, and you can build and run new applications on top of this model.

1. Connect · auth

  • REST base: http://<host>:8892/v1 (127.0.0.1 on the same PC)
  • Realtime (WS): ws://<host>:8892/v1/realtime (a bare ws://<host>:8892 also works — path /v1/realtime + default model auto-applied)
  • Connect: on local (127.0.0.1) / Tailscale, no auth is required — the container self-verifies its license. Clients that need a key field (OpenAI SDK etc.) can pass any value (naia). When exposing remotely, put §4.4 Tailscale/VPN in front.

🔑 One key — the subscription key

  • Subscription key — the subscription key you get from the portal. Used only at container run time (activation) (-e NAIA_ACCOUNT_TOKEN=<subscription-key>). It checks the subscription and obtains a time-bound license (certificate).
  • There is no separate connection key. Once activated, the container self-verifies locally with the certificate, so clients (naia-os, OpenAI SDK) just need to connect by URL127.0.0.1 on the same PC, or Tailscale/VPN (§4.4) from another device. It does not call the gateway per connection.
  • The api_key in the examples below is a placeholder (OpenAI SDK requires the field) — the offline container doesn't check it, so any value like "naia" works.

2. Endpoints (OpenAI-compatible)

EndpointUseBackend
GET /healthReadiness {"ready":true,"services":{tts,stt,llm},"vad":true} (no auth)
GET /v1/modelsModel list
WS /v1/realtimeRealtime voice session (VAD, barge-in, emotion)cascade
POST /v1/chat/completionsChat (streaming)gemma4-e4b
POST /v1/audio/speechText-to-speech (TTS)VoxCPM2
POST /v1/audio/transcriptionsSpeech-to-text (STT)Whisper
POST /v1/embeddingsEmbeddingsbge-m3

Chat (curl):

curl -s http://127.0.0.1:8892/v1/chat/completions \
  -H "Authorization: Bearer naia" -H "Content-Type: application/json" \
  -d '{"model":"naia-0.9-omni-24g","messages":[{"role":"user","content":"hi"}],"stream":false}'

OpenAI SDK (Python) — just swap baseURL:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8892/v1", api_key="naia")
print(client.chat.completions.create(
    model="naia-0.9-omni-24g",
    messages=[{"role": "user", "content": "hi"}],
).choices[0].message.content)

Transcription (STT):

curl -s http://127.0.0.1:8892/v1/audio/transcriptions \
  -H "Authorization: Bearer naia" \
  -F file=@sample.wav -F model=naia-0.9-omni-24g

3. Realtime voice — connection flow (WS)

Same flow the 4.3 live demo uses. (Offline starts immediately, with no gateway queue/assignment.)

  1. Connect — open ws://<host>:8892.

  2. First frame (auth · language) — browser WebSockets can't send headers, so send as the first message:

    { "setup": { "apiKey": "naia", "locale": "en" } }
    
  3. When the server sends session.created, configure the session with session.update:

    {
      "type": "session.update",
      "session": {
        "modalities": ["text", "audio"],
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "instructions": "<persona instructions>",
        "turn_detection": { "type": "server_vad" },
        "input_audio_transcription": { "language": "en" },
        "ref_audio_url": "<URL of a voice sample to mimic (optional)>"
      }
    }
    
  4. Exchange

    Client → Server
    Voice input{"type":"input_audio_buffer.append","audio":"<base64 PCM16 24kHz>"} (server VAD detects end of speech)
    Text inputconversation.item.create then response.create
    Barge-inresponse.cancel
    Server → Client
    response.audio.deltabase64 PCM16 24kHz audio chunk
    response.audio_transcript.delta / response.text.deltaanswer text (streaming)
    conversation.item.input_audio_transcription.completedtranscript of your speech
    emotion.updatedemotion / prosody tag (§5)
    response.doneend of one turn

4. Languages — 30 languages (default = auto/global)

The model supports 30 languages (Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese).

  • Default (unset) = global/auto — it detects the language you spoke and replies in that language (per turn).
  • To pin a specific language, give an ISO-639-1 code (e.g. ko/en/ja) in setup.locale or in input_audio_transcription.language of session.update.

5. Output format (emotion · prosody tags)

The output format is tuned for voice conversation — if the client knows it, it can express more richly.

  • Prosody tags: the answer text contains lowercase English bracket tags like [laughing], [sigh], [breath], [pause], [hesitation] mixed in where emotion shifts (for speech prosody). The model is instructed not to use Korean tags like [웃음], parenthesized stage directions like (smiling), or asterisks like *smiles*. Known vocabulary: laughing/laugh/laughter/chuckle/giggle · sigh/exhale · breath/inhale · pause · hesitation · gasp/cough/sneeze/yawn/sniff/hum · cry/sob/moan/whisper/shout/cheer (other tags are passed through).
  • For each tag, the server sends a 1:1 emotion.updated event (state == tag name, lowercase):
    { "type": "emotion.updated", "state": "laughing", "tag": "[laughing]", "known": true }
    
  • The TTS path keeps the tags and feeds them into synthesis for speech prosody, while chat text.delta sends clean text with tags stripped. (No emojis, markdown, or parenthetical self-narration in the output.)
  • Client mapping (naia-os reference): map emotion.updated.state (prosody tag) to avatar expressions — laughing/chuckle/giggle/cheer → happy, sigh/exhale/cry/sob → sad, gasp → surprised, shout → angry, hesitation → think. Non-emotional prosody like breath·pause does not change the expression (keep the previous one — so it doesn't blink to neutral on every breath).
  • Robust handling recommended: LLM output isn't always exact. Prefer emotion.updated, but if it's missing, auto-detect tags in the transcript itself (uppercase [HAPPY] / lowercase prosody tags) or leaked stage directions ((smiles)·*sigh*) and reflect them in the expression; if there's no cue, keep the current expression (cf. naia-os shell/src/lib/vrm/expression.ts extractExpression).

6. See also

  • Reference implementation / sample code (open source): naia-os's voice client shell/src/lib/voice/ (Apache 2.0) — contains the actual client that talks to this API (naia-omni.ts) and the emotion/prosody handling (emotion-tags.ts; expression mapping & robust extraction in vrm/expression.ts). Use it as a starting point for testing new models and building Tauri apps. Try it live at 4.3 live demo.
  • Lineup & pricing: 4.1 Model Pricing
  • Cloud (planned): 4.6 Online