Naia Model — Developer Guide

Developer guide for using the Naia model from code. After running the model via 4.4 Naia Model Download, use the OpenAI-compatible API served locally (no gateway, no queue) as-is. With any OpenAI SDK or tool, you only point the baseURL at this model.

Not naia-os/shell-only — any code that speaks OpenAI Realtime/Chat/Audio/Embeddings connects as-is, and you can build and run new applications on top of this model.

1. Connect · auth

REST base: http://<host>:8892/v1 (127.0.0.1 on the same PC)
Realtime (WS): ws://<host>:8892/v1/realtime (a bare ws://<host>:8892 also works — path /v1/realtime + default model auto-applied)
Connect: on local (127.0.0.1) / Tailscale, no auth is required — the container self-verifies its license. Clients that need a key field (OpenAI SDK etc.) can pass any value (naia). When exposing remotely, put §4.4 Tailscale/VPN in front.

🔑 One key — the subscription key

Subscription key — the subscription key you get from the portal. Used only at container run time (activation) (-e NAIA_ACCOUNT_TOKEN=<subscription-key>). It checks the subscription and obtains a time-bound license (certificate).
There is no separate connection key. Once activated, the container self-verifies locally with the certificate, so clients (naia-os, OpenAI SDK) just need to connect by URL — 127.0.0.1 on the same PC, or Tailscale/VPN (§4.4) from another device. It does not call the gateway per connection.
The api_key in the examples below is a placeholder (OpenAI SDK requires the field) — the offline container doesn't check it, so any value like "naia" works.

2. Endpoints (OpenAI-compatible)

Endpoint	Use
`GET /health`	Readiness `{"ready":true,"services":{tts,stt,llm},"vad":true}` (no auth)
`GET /v1/models`	Model list
`WS /v1/realtime`	Realtime voice session (VAD, barge-in, emotion)
`POST /v1/chat/completions`	Chat (streaming)
`POST /v1/audio/speech`	Text-to-speech (TTS)
`POST /v1/audio/transcriptions`	Speech-to-text (STT)
`POST /v1/embeddings`	Embeddings

Chat (curl):

curl -s http://127.0.0.1:8892/v1/chat/completions \
  -H "Authorization: Bearer naia" -H "Content-Type: application/json" \
  -d '{"model":"naia-0.9-omni-24g","messages":[{"role":"user","content":"hi"}],"stream":false}'

OpenAI SDK (Python) — just swap baseURL:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8892/v1", api_key="naia")
print(client.chat.completions.create(
    model="naia-0.9-omni-24g",
    messages=[{"role": "user", "content": "hi"}],
).choices[0].message.content)

Transcription (STT):

curl -s http://127.0.0.1:8892/v1/audio/transcriptions \
  -H "Authorization: Bearer naia" \
  -F file=@sample.wav -F model=naia-0.9-omni-24g

3. Realtime voice — connection flow (WS)

Same flow the 4.3 live demo uses. (Offline starts immediately, with no gateway queue/assignment.)

Connect — open ws://<host>:8892.
First frame (auth · language) — browser WebSockets can't send headers, so send as the first message:
```
{ "setup": { "apiKey": "naia", "locale": "en" } }
```

When the server sends session.created, configure the session with session.update:

{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "instructions": "<persona instructions>",
    "turn_detection": { "type": "server_vad" },
    "input_audio_transcription": { "language": "en" },
    "ref_audio_url": "<URL of a voice sample to mimic (optional)>"
  }
}

Exchange

Client → Server
Voice input	`{"type":"input_audio_buffer.append","audio":"<base64 PCM16 24kHz>"}` (server VAD detects end of speech)
Text input	`conversation.item.create` then `response.create`
Barge-in	`response.cancel`

Server → Client
`response.audio.delta`	base64 PCM16 24kHz audio chunk
`response.audio_transcript.delta` / `response.text.delta`	answer text (streaming)
`conversation.item.input_audio_transcription.completed`	transcript of your speech
`emotion.updated`	emotion / prosody tag (§5)
`response.done`	end of one turn

4. Languages — 30 languages (default = auto/global)

The model supports 30 languages (Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese).

Default (unset) = global/auto — it detects the language you spoke and replies in that language (per turn).
To pin a specific language, give an ISO-639-1 code (e.g. ko/en/ja) in setup.locale or in input_audio_transcription.language of session.update.

5. Output format (emotion · prosody tags)

The output format is tuned for voice conversation — if the client knows it, it can express more richly.

Prosody tags: the answer text contains lowercase English bracket tags like [laughing], [sigh], [breath], [pause], [hesitation] mixed in where emotion shifts (for speech prosody). The model is instructed not to use Korean tags like [웃음], parenthesized stage directions like (smiling), or asterisks like *smiles*. Known vocabulary: laughing/laugh/laughter/chuckle/giggle · sigh/exhale · breath/inhale · pause · hesitation · gasp/cough/sneeze/yawn/sniff/hum · cry/sob/moan/whisper/shout/cheer (other tags are passed through).
For each tag, the server sends a 1:1 emotion.updated event (state == tag name, lowercase):
```
{ "type": "emotion.updated", "state": "laughing", "tag": "[laughing]", "known": true }
```
The TTS path keeps the tags and feeds them into synthesis for speech prosody, while chat text.delta sends clean text with tags stripped. (No emojis, markdown, or parenthetical self-narration in the output.)
Client mapping (naia-os reference): map emotion.updated.state (prosody tag) to avatar expressions — laughing/chuckle/giggle/cheer → happy, sigh/exhale/cry/sob → sad, gasp → surprised, shout → angry, hesitation → think. Non-emotional prosody like breath·pause does not change the expression (keep the previous one — so it doesn't blink to neutral on every breath).
Robust handling recommended: LLM output isn't always exact. Prefer emotion.updated, but if it's missing, auto-detect tags in the transcript itself (uppercase [HAPPY] / lowercase prosody tags) or leaked stage directions ((smiles)·*sigh*) and reflect them in the expression; if there's no cue, keep the current expression (cf. naia-os shell/src/lib/vrm/expression.ts extractExpression).

6. Swap the chat model · ship a new version (operations)

The detailed guide for changing things directly from the command line. Individual subscribers can use it as-is (no key needed), and it includes a lock option for shared/kiosk operation. For the easy summary, see 4.4 Offline.

6.1 Swap the chat model (from 0.91)

Leave the container as is and swap only the model that handles the conversation at runtime. The voice (speaking/listening), the watermark, and the subscription verification all stay the same.

Three things to know first:

The default chat model is a built-in open-weight LLM. You can swap it and revert to the default at any time.
The model you bring in must be in GGUF format. And since the voice features use about 10GB of memory, the chat model goes up to roughly 14GB. Larger models are rejected, and if a load happens to fail, it automatically falls back to the model you were using (the conversation isn't interrupted).
Individual subscribers don't need a separate key. The subscription verification (license) on your own machine is itself your authority, so you can just swap with the command below — the same as voice not needing a key. (Only on a shared/kiosk box used by multiple people can the operator apply a lock at run time with -e NAIA_ADMIN_KEY=your_chosen_password, and in that case you also send -H "Authorization: Bearer your_chosen_password" with the request.)

Hands-on — just set the address:

BASE=http://127.0.0.1:8892     # from another device, use the https address from §4.4 (e.g. ...:8443)

① See which model is running now and how much memory is left:

curl -s $BASE/admin/llm/status

② Swap the model — just replace the model name inside the double quotes and paste. You can put a HuggingFace model card URL (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF) or its id (Qwen/Qwen2.5-7B-Instruct-GGUF) as-is:

curl -s -X POST $BASE/admin/llm/swap \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct-GGUF","pull":true}'

The hf.co/ prefix and the quant are added automatically (the default is Q4_K_M). If you want a specific quant, append it like Qwen/Qwen2.5-7B-Instruct-GGUF:Q5_K_M. The first time a model is fetched it takes from tens of seconds to a few minutes.

②-offline — swap with a GGUF file you already have, with no internet. For exhibitions/consultations where there's no internet, register an already-owned GGUF file instead of fetching from HuggingFace. (Rule of thumb: a name with a slash like org/repo means HuggingFace online, while a plain name with no slash means a local model.)

Copy and paste one line at a time. Put the name you want in place of mymodel and the real filename in place of mymodel.gguf:

podman cp ./mymodel.gguf naia-omni:/app/models/mymodel.gguf

podman exec naia-omni sh -lc 'printf "FROM /app/models/mymodel.gguf\n" > /tmp/Modelfile && ollama create mymodel -f /tmp/Modelfile'

curl -s -X POST $BASE/admin/llm/swap -H "Content-Type: application/json" -d '{"model":"mymodel:latest","pull":false}'

⚠️ A GGUF you converted/merged yourself may be missing the chat template, so responses can ramble or get cut off. In that case, add the model family's chat template (TEMPLATE) and stop tokens (PARAMETER stop) to the Modelfile in step 2 when registering — for developer details see [reference implementation §7]. (Official HuggingFace Instruct GGUFs usually have these built in and work as-is.)

③ Revert to the default model:

curl -s -X POST $BASE/admin/llm/restore

On a shared/kiosk box (where the operator set NAIA_ADMIN_KEY), add -H "Authorization: Bearer your_chosen_password" to each command above. Individual subscribers don't need it.

After swapping, apps like naia-os keep connecting to the same address (no need to reconnect). If you want it to keep starting with that model after a restart or update, set the default model when launching the container with -e NAIA_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-GGUF.

6.2 Update to a new version

When a new version comes out, change only the image (version) and leave the subscription/settings as is. The first time you turn on the new version, the container re-authenticates automatically over the internet (same existing subscription/device — no need to re-enter the key by hand). So you must be connected to the internet when updating.

podman pull ghcr.io/nextain/naia-0.9-omni-24g:latest      # fetch the latest version
podman stop naia-omni && podman rm naia-omni      # clean up only the container (see caution below)
# Re-run the exact run command you used at first install — just mount the same license volume and you're done.

⚠️ Do not press "release device" when updating. Release is only for moving to a different computer. If you release while trying to update, you'll have to authenticate from scratch. Updates keep the subscription and device registration as long as you keep the license volume intact.

If you've already authenticated, just fetch the latest version and turn it on again as above, and you'll move straight to the new version that can swap models (authentication preserved). To fetch a specific version, use a version number like :0.91 instead of :latest.

7. See also

Reference implementation / sample code (open source): naia-os's voice client shell/src/lib/voice/ (Apache 2.0) — contains the actual client that talks to this API (naia-omni.ts) and the emotion/prosody handling (emotion-tags.ts; expression mapping & robust extraction in vrm/expression.ts). Use it as a starting point for testing new models and building Tauri apps. Try it live at 4.3 live demo.
Lineup & pricing: 4.1 Model Pricing
Cloud (planned): 4.6 Online

4.5. Naia Model — Developer Guide