Naia
Table of Contents
  1. 1Video Manual
  2. 2Naia OS Live USB
  3. 3Installation
  4. 3.1Naia OS Installation (ISO)
  5. 3.2Linux App Installation
  6. 4Getting Started
  7. 4.1Naia Model Pricing
  8. 4.2naia-0.9-omni-24g Realtime Multimodal Module
  9. 4.3Live Demo
  10. 4.4Naia Model Download
  11. 4.5Naia Model — Developer Guide
  12. 4.6Naia Model Online (planned)
  13. 5Main Screen
  14. 6Chat
  15. 7Conversation History
  16. 8Work Progress
  17. 9Skills
  18. 10Channels
  19. 11Agents
  20. 12Diagnostics
  21. 13Workspace
  22. 14Browser
  23. 15Panel Management
  24. 16Voice Chat
  25. 17Settings
  26. 18Tool Details
  27. 19Naia Account
  28. 20Troubleshooting
  29. 21Open Source Usage & Contribution

4.2. naia-0.9-omni-24g Realtime Multimodal Module

naia-0.9-omni-24g uses the same interface as an omni model, but it isn't, strictly, an omni model. It's a cascade module that naia builds by weaving several models together, aiming to be a realtime multimodal "brain." Today it starts by listening to your voice in real time and replying in voice; as versions advance, it expands toward seeing, remembering, and thinking more richly.

What it aims to be

The goal is for naia to grow beyond "listening and speaking" into a realtime cognitive module that sees, remembers, and understands context.

  • Now: realtime voice conversation — listen → think → speak
  • Ahead: we plan to add cognitive abilities such as image input (since current LLMs already handle images, this can come relatively soon), long-term memory (naia-memory) integration, and retrieval augmentation (naia-agent RAG). (What gets added, and in what order, is not fixed.)
  • Direction: a realtime multimodal cognitive module that brings vision, memory, and other abilities together.

The name naia-0.9-omni-24g refers not to a single voice capability but to one realtime multimodal endpoint that will carry all of these.

Looks like an omni model — but it's actually a cascade

naia-0.9-omni-24g is served in the same place, in the same way, as an omni (unified multimodal) model. To a client it's exactly like calling a single omni model. But internally it's not one model — it's a cascade.

What is a cascade

A cascade builds the whole capability by chaining proven, role-specific parts in sequence instead of doing everything with one giant model. naia-0.9-omni-24g chains speech recognition (STT) → language model (LLM) → speech synthesis (TTS), with voice activity detection (VAD) and emotion handling in between.

voice in → speech recognition (STT) → language model (LLM) → speech synthesis (TTS) → voice out

Because each stage is independent, multiple models can be used together. Choosing the best model for each stage is itself a kind of orchestration — a cascade isn't just a chain, it's an assembly that weaves several models into a better result.

Single omni model vs cascade

Single omni modelCascade (naia's way)
Compositionone unified modelassembled role-specific parts
Changing abilityfixed once trained — new ability needs retrainingswap/add parts anytime — improve without retraining
Speedunified, so fast (low latency)extra stages may add a little latency
Multimodal growthretrain from scratchplug parts into input/output
Part choicelocked as onepick & swap proven parts
Using modelsone fixed modelmultiple models together — best per stage, orchestration in itself

→ naia chose a cascade to add abilities quickly and safely. The smooth omni-like experience is delivered through one standard endpoint, while the inside grows as a flexible assembly.

Other characteristics

  • Barge-in: interrupt mid-sentence and start over. This faithfully reproduces the natural barge-in feel of a live omni model.
  • A single 24GB GPU (RTX 3090 / 4090 / A5000) tier. The -24g suffix means exactly this — it's designed from the start to run on a single GPU in a personal PC (the cloud just rents the same setup).
  • Exposed as one single endpoint — the client never needs to know if the backend is a local or cloud GPU. This single endpoint stays even as it expands to images, video, and memory.

Why this shape grows into a "realtime multimodal brain"

  • Realtime, bidirectional: data flows both ways without breaks while the connection is open. Input is processed the moment it arrives; responses stream the moment they're generated. It's not one-question-one-answer — it exchanges anything in real time while the conversation is alive.
  • Cascade (modular) structure: input (perception), thinking (LLM), and output (expression) are separated, so adding an image/video encoder on input or a new expression on output is just plugging a part in. That's how it grows to take in anything and answer with anything, in real time.
  • Instead of retraining one monolithic model, it swaps in proven parts to add abilities quickly and safely.

Usage

naia-0.9-omni-24g is designed from the start to run standalone on a single 24GB GPU. Pricing and how to use it are covered on dedicated pages:

Easiest way — Live Demo

A 1-minute web demo to experience naia-0.9-omni-24g's voice quality with the free credits issued at sign-up (a quality preview, not the full service). Mic/speaker status, persona changes, reference-voice (URL) changes, and text input are supported.

👉 Open the Live Demo

Using it in naia-os

In Settings > AI, select naia-0.9-omni-24g from the model list. No API key needed — it runs on your Naia account credits.

Developers — call the API directly (the gateway Realtime API) — see 4.5 Online version.