Developer guide for using the Naia model from code. After running the model via 4.4 Naia Model Download, use the OpenAI-compatible API served locally (no gateway, no queue) as-is. With any OpenAI SDK or tool, you only point the baseURL at this model.
Not naia-os/shell-only — any code that speaks OpenAI Realtime/Chat/Audio/Embeddings connects as-is, and you can build and run new applications on top of this model.
1. Connect · auth
- REST base:
http://<host>:8892/v1(127.0.0.1on the same PC) - Realtime (WS):
ws://<host>:8892/v1/realtime(a barews://<host>:8892also works — path/v1/realtime+ default model auto-applied) - Connect: on local (
127.0.0.1) / Tailscale, no auth is required — the container self-verifies its license. Clients that need a key field (OpenAI SDK etc.) can pass any value (naia). When exposing remotely, put §4.4 Tailscale/VPN in front.
🔑 One key — the subscription key
- Subscription key — the subscription key you get from the portal. Used only at container run time (activation) (
-e NAIA_ACCOUNT_TOKEN=<subscription-key>). It checks the subscription and obtains a time-bound license (certificate). - There is no separate connection key. Once activated, the container self-verifies locally with the certificate, so clients (naia-os, OpenAI SDK) just need to connect by URL —
127.0.0.1on the same PC, or Tailscale/VPN (§4.4) from another device. It does not call the gateway per connection. - The
api_keyin the examples below is a placeholder (OpenAI SDK requires the field) — the offline container doesn't check it, so any value like"naia"works.
2. Endpoints (OpenAI-compatible)
| Endpoint | Use | Backend |
|---|---|---|
GET /health | Readiness {"ready":true,"services":{tts,stt,llm},"vad":true} (no auth) | — |
GET /v1/models | Model list | — |
WS /v1/realtime | Realtime voice session (VAD, barge-in, emotion) | cascade |
POST /v1/chat/completions | Chat (streaming) | gemma4-e4b |
POST /v1/audio/speech | Text-to-speech (TTS) | VoxCPM2 |
POST /v1/audio/transcriptions | Speech-to-text (STT) | Whisper |
POST /v1/embeddings | Embeddings | bge-m3 |
Chat (curl):
curl -s http://127.0.0.1:8892/v1/chat/completions \
-H "Authorization: Bearer naia" -H "Content-Type: application/json" \
-d '{"model":"naia-0.9-omni-24g","messages":[{"role":"user","content":"hi"}],"stream":false}'
OpenAI SDK (Python) — just swap baseURL:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8892/v1", api_key="naia")
print(client.chat.completions.create(
model="naia-0.9-omni-24g",
messages=[{"role": "user", "content": "hi"}],
).choices[0].message.content)
Transcription (STT):
curl -s http://127.0.0.1:8892/v1/audio/transcriptions \
-H "Authorization: Bearer naia" \
-F file=@sample.wav -F model=naia-0.9-omni-24g
3. Realtime voice — connection flow (WS)
Same flow the 4.3 live demo uses. (Offline starts immediately, with no gateway queue/assignment.)
-
Connect — open
ws://<host>:8892. -
First frame (auth · language) — browser WebSockets can't send headers, so send as the first message:
{ "setup": { "apiKey": "naia", "locale": "en" } } -
When the server sends
session.created, configure the session withsession.update:{ "type": "session.update", "session": { "modalities": ["text", "audio"], "input_audio_format": "pcm16", "output_audio_format": "pcm16", "instructions": "<persona instructions>", "turn_detection": { "type": "server_vad" }, "input_audio_transcription": { "language": "en" }, "ref_audio_url": "<URL of a voice sample to mimic (optional)>" } } -
Exchange
Client → Server Voice input {"type":"input_audio_buffer.append","audio":"<base64 PCM16 24kHz>"}(server VAD detects end of speech)Text input conversation.item.createthenresponse.createBarge-in response.cancelServer → Client response.audio.deltabase64 PCM16 24kHz audio chunk response.audio_transcript.delta/response.text.deltaanswer text (streaming) conversation.item.input_audio_transcription.completedtranscript of your speech emotion.updatedemotion / prosody tag (§5) response.doneend of one turn
4. Languages — 30 languages (default = auto/global)
The model supports 30 languages (Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese).
- Default (unset) = global/auto — it detects the language you spoke and replies in that language (per turn).
- To pin a specific language, give an ISO-639-1 code (e.g.
ko/en/ja) insetup.localeor ininput_audio_transcription.languageofsession.update.
5. Output format (emotion · prosody tags)
The output format is tuned for voice conversation — if the client knows it, it can express more richly.
- Prosody tags: the answer text contains lowercase English bracket tags like
[laughing],[sigh],[breath],[pause],[hesitation]mixed in where emotion shifts (for speech prosody). The model is instructed not to use Korean tags like[웃음], parenthesized stage directions like(smiling), or asterisks like*smiles*. Known vocabulary:laughing/laugh/laughter/chuckle/giggle · sigh/exhale · breath/inhale · pause · hesitation · gasp/cough/sneeze/yawn/sniff/hum · cry/sob/moan/whisper/shout/cheer(other tags are passed through). - For each tag, the server sends a 1:1
emotion.updatedevent (state== tag name, lowercase):{ "type": "emotion.updated", "state": "laughing", "tag": "[laughing]", "known": true } - The TTS path keeps the tags and feeds them into synthesis for speech prosody, while chat
text.deltasends clean text with tags stripped. (No emojis, markdown, or parenthetical self-narration in the output.) - Client mapping (naia-os reference): map
emotion.updated.state(prosody tag) to avatar expressions —laughing/chuckle/giggle/cheer → happy,sigh/exhale/cry/sob → sad,gasp → surprised,shout → angry,hesitation → think. Non-emotional prosody likebreath·pausedoes not change the expression (keep the previous one — so it doesn't blink to neutral on every breath). - Robust handling recommended: LLM output isn't always exact. Prefer
emotion.updated, but if it's missing, auto-detect tags in the transcript itself (uppercase[HAPPY]/ lowercase prosody tags) or leaked stage directions ((smiles)·*sigh*) and reflect them in the expression; if there's no cue, keep the current expression (cf. naia-osshell/src/lib/vrm/expression.tsextractExpression).
6. See also
- Reference implementation / sample code (open source): naia-os's voice client
shell/src/lib/voice/(Apache 2.0) — contains the actual client that talks to this API (naia-omni.ts) and the emotion/prosody handling (emotion-tags.ts; expression mapping & robust extraction invrm/expression.ts). Use it as a starting point for testing new models and building Tauri apps. Try it live at 4.3 live demo. - Lineup & pricing: 4.1 Model Pricing
- Cloud (planned): 4.6 Online