ナイア
目次
  1. 1動画マニュアル
  2. 2Naia OS Live USB
  3. 3インストールと展開
  4. 3.1Naia OS インストール (ISO)
  5. 3.2アプリインストール
  6. 4はじめに
  7. 4.1Naia Model Pricing
  8. 4.2naia-omni-cascade
  9. 4.3demo
  10. 4.4naia-offline
  11. 4.5naia-model-dev
  12. 4.6naia-online
  13. 5メイン画面
  14. 6チャット
  15. 7会話履歴
  16. 8作業の進捗状況
  17. 9スキル
  18. 10チャンネル
  19. 11エージェント
  20. 12診断
  21. 13ワークスペース
  22. 14ブラウザ
  23. 15パネル管理
  24. 16音声会話
  25. 17設定
  26. 18ツールの詳細
  27. 19Naia アカウント
  28. 20トラブルシューティング
  29. 21オープンソースの活用と貢献

4.2. naia-omni-cascade

naia-0.9-omni-24g uses the same interface as an omni model, but it isn't, strictly, an omni model. It's a cascade module that naia builds by weaving several models together, aiming to be a realtime multimodal "brain." Today it starts by listening to your voice in real time and replying in voice; as versions advance, it expands toward seeing, remembering, and thinking more richly.

What it aims to be

The goal is for naia to grow beyond "listening and speaking" into a realtime cognitive module that sees, remembers, and understands context.

  • Now: realtime voice conversation — listen → think → speak
  • Ahead: we plan to add cognitive abilities such as image input (since current LLMs already handle images, this can come relatively soon), long-term memory (naia-memory) integration, and retrieval augmentation (naia-agent RAG). (What gets added, and in what order, is not fixed.)
  • Direction: a realtime multimodal cognitive module that brings vision, memory, and other abilities together.

The name naia-0.9-omni-24g refers not to a single voice capability but to one realtime multimodal endpoint that will carry all of these.

Looks like an omni model — but it's actually a cascade

naia-0.9-omni-24g is served in the same place, in the same way, as an omni (unified multimodal) model. To a client it's exactly like calling a single omni model. But internally it's not one model — it's a cascade.

What is a cascade

A cascade builds the whole capability by chaining proven, role-specific parts in sequence instead of doing everything with one giant model. naia-0.9-omni-24g chains speech recognition (STT) → language model (LLM) → speech synthesis (TTS), with voice activity detection (VAD) and emotion handling in between.

voice in → speech recognition (STT) → language model (LLM) → speech synthesis (TTS) → voice out

Because each stage is independent, multiple models can be used together. Choosing the best model for each stage is itself a kind of orchestration — a cascade isn't just a chain, it's an assembly that weaves several models into a better result.

Single omni model vs cascade

Single omni modelCascade (naia's way)
Compositionone unified modelassembled role-specific parts
Changing abilityfixed once trained — new ability needs retrainingswap/add parts anytime — improve without retraining
Speedunified, so fast (low latency)extra stages may add a little latency
Multimodal growthretrain from scratchplug parts into input/output
Part choicelocked as onepick & swap proven parts
Using modelsone fixed modelmultiple models together — best per stage, orchestration in itself

→ naia chose a cascade to add abilities quickly and safely. The smooth omni-like experience is delivered through one standard endpoint, while the inside grows as a flexible assembly.

Other characteristics

  • Barge-in: interrupt mid-sentence and start over. This faithfully reproduces the natural barge-in feel of a live omni model.
  • A single 24GB GPU (RTX 3090 / 4090 / A5000) tier. The -24g suffix means exactly this — it's designed from the start to run on a single GPU in a personal PC (the cloud just rents the same setup).
  • Exposed as one single endpoint — the client never needs to know if the backend is a local or cloud GPU. This single endpoint stays even as it expands to images, video, and memory.

Why this shape grows into a "realtime multimodal brain"

  • Realtime, bidirectional: data flows both ways without breaks while the connection is open. Input is processed the moment it arrives; responses stream the moment they're generated. It's not one-question-one-answer — it exchanges anything in real time while the conversation is alive.
  • Cascade (modular) structure: input (perception), thinking (LLM), and output (expression) are separated, so adding an image/video encoder on input or a new expression on output is just plugging a part in. That's how it grows to take in anything and answer with anything, in real time.
  • Instead of retraining one monolithic model, it swaps in proven parts to add abilities quickly and safely.

Usage

naia-0.9-omni-24g is designed from the start to run standalone on a single 24GB GPU. Pricing and how to use it are covered on dedicated pages:

Easiest way — Live Demo

A 1-minute web demo to experience naia-0.9-omni-24g's voice quality with the free credits issued at sign-up (a quality preview, not the full service). Mic/speaker status, persona changes, reference-voice (URL) changes, and text input are supported.

👉 Open the Live Demo

Using it in naia-os

In Settings > AI, select naia-0.9-omni-24g from the model list. No API key needed — it runs on your Naia account credits.

Developers — call the API directly (the gateway Realtime API) — see 4.5 Online version.