naia-omni-cascade — Naia マニュアル

naia-0.9-omni-24g uses the same interface as an omni model, but it isn't, strictly, an omni model. It's a cascade module that naia builds by weaving several models together, aiming to be a realtime multimodal "brain." Today it starts by listening to your voice in real time and replying in voice; as versions advance, it expands toward seeing, remembering, and thinking more richly.

What it aims to be

The goal is for naia to grow beyond "listening and speaking" into a realtime cognitive module that sees, remembers, and understands context.

Now: realtime voice conversation — listen → think → speak
Ahead: we plan to add cognitive abilities such as image input (since current LLMs already handle images, this can come relatively soon), long-term memory (naia-memory) integration, and retrieval augmentation (naia-agent RAG). (What gets added, and in what order, is not fixed.)
Direction: a realtime multimodal cognitive module that brings vision, memory, and other abilities together.

The name naia-0.9-omni-24g refers not to a single voice capability but to one realtime multimodal endpoint that will carry all of these.

Looks like an omni model — but it's actually a cascade

naia-0.9-omni-24g is served in the same place, in the same way, as an omni (unified multimodal) model. To a client it's exactly like calling a single omni model. But internally it's not one model — it's a cascade.

What is a cascade

A cascade builds the whole capability by chaining proven, role-specific parts in sequence instead of doing everything with one giant model. naia-0.9-omni-24g chains speech recognition (STT) → language model (LLM) → speech synthesis (TTS), with voice activity detection (VAD) and emotion handling in between.

voice in → speech recognition (STT) → language model (LLM) → speech synthesis (TTS) → voice out

Because each stage is independent, multiple models can be used together. Choosing the best model for each stage is itself a kind of orchestration — a cascade isn't just a chain, it's an assembly that weaves several models into a better result.

Single omni model vs cascade

	Single omni model	Cascade (naia's way)
Composition	one unified model	assembled role-specific parts
Changing ability	fixed once trained — new ability needs retraining	swap/add parts anytime — improve without retraining
Speed	unified, so fast (low latency)	extra stages may add a little latency
Multimodal growth	retrain from scratch	plug parts into input/output
Part choice	locked as one	pick & swap proven parts
Using models	one fixed model	multiple models together — best per stage, orchestration in itself

→ naia chose a cascade to add abilities quickly and safely. The smooth omni-like experience is delivered through one standard endpoint, while the inside grows as a flexible assembly.

Other characteristics

Barge-in: interrupt mid-sentence and start over. This faithfully reproduces the natural barge-in feel of a live omni model.
A single 24GB GPU (RTX 3090 / 4090 / A5000) tier. The -24g suffix means exactly this — it's designed from the start to run on a single GPU in a personal PC (the cloud just rents the same setup).
Exposed as one single endpoint — the client never needs to know if the backend is a local or cloud GPU. This single endpoint stays even as it expands to images, video, and memory.

Why this shape grows into a "realtime multimodal brain"

Realtime, bidirectional: data flows both ways without breaks while the connection is open. Input is processed the moment it arrives; responses stream the moment they're generated. It's not one-question-one-answer — it exchanges anything in real time while the conversation is alive.
Cascade (modular) structure: input (perception), thinking (LLM), and output (expression) are separated, so adding an image/video encoder on input or a new expression on output is just plugging a part in. That's how it grows to take in anything and answer with anything, in real time.
Instead of retraining one monolithic model, it swaps in proven parts to add abilities quickly and safely.

Usage

naia-0.9-omni-24g is designed from the start to run standalone on a single 24GB GPU. Pricing and how to use it are covered on dedicated pages:

Free 1-minute trial → 4.3 Live Demo
Run on your own GPU (offline · $10/month subscription, individuals only) → 4.4 Offline version
Use via the cloud (online · planned) → 4.5 Online version
Full pricing & lineup → 4.1 Model pricing

Easiest way — Live Demo

A 1-minute web demo to experience naia-0.9-omni-24g's voice quality with the free credits issued at sign-up (a quality preview, not the full service). Mic/speaker status, persona changes, reference-voice (URL) changes, and text input are supported.

👉 Open the Live Demo

Using it in naia-os

In Settings > AI, select naia-0.9-omni-24g from the model list. No API key needed — it runs on your Naia account credits.

Developers — call the API directly (the gateway Realtime API) — see 4.5 Online version.

4.2. naia-omni-cascade