A follow-up to Gemini 3.1 Flash Live Support + MiniCPM-o 4.5 vLLM Challenge — Naia OS's S2S Real-time Voice AI Development Log. After that one, I went quiet for a while.
A lot has happened since. Nextain took on operating a Korean Christian portal (www.onmam.com) and we've been talking about its AI integration; on the Naia side, we were selected for a Korean government R&D project, which gives the service real momentum. Soon we'll be able to show Naia's real avatar — not just the default Vroid Hub one.
At the end of the previous post I wrote "next week I'm borrowing some graphics cards, so things should get easier" — and I'm finally able to keep that promise. On a two-RTX-3090 setup we ported MiniCPM-o 4.5 to vLLM-omni, hooked it up to the Naia client, and verified real-time English conversation + audio reference (voice cloning) end-to-end.
For context, we are the first to put MiniCPM-o 4.5 onto vLLM-omni, and a few people abroad have actually been asking how far we've got. There's another attempt by someone else at upstream's #1182, but it was never merged, and we've been on a separate track. It runs at a reasonable level now, and we're at the stage of cleaning it up to a level upstream maintainers would actually accept.
Why target MiniCPM-o 4.5?
I covered this in the previous post but it's worth restating: MiniCPM-o 4.5 is effectively the only omni model an individual can run on a single RTX 3090 that supports both audio-reference voice cloning and fine-tuning at the same time.
- GPT-4o / Gemini Live — cloud-only, weights closed, no fine-tuning
- Moshi — open source with excellent full-duplex, but English/French centric and the community has gone quiet
- Qwen3-Omni-30B — at 30B it doesn't fit on a single 24 GB card without quantization, and quantization breaks the audio output
- MiniCPM-o 4.5 — 9B (Whisper-medium + Qwen3-8B + CosyVoice2 + SigLip2 stitched together), runs on a single 24 GB card, voice cloning via audio reference, and fine-tunable
The Korean-language gap is a real downside and the place we can contribute. For this round of the AI Champion competition we submitted under the team name "Remember You", proposing a vLLM-omni Korean fine-tuning effort plus a VTuber service built on Naia Memory. We couldn't apply to the Domestic category at Dokpamo because no domestic model exists yet at the omni level we wanted.
"Wouldn't everyone want an AI VTuber on their own PC, that remembers them and speaks in their own voice?" To make that real, we think the following four pieces have to come together — and these are exactly what we're focused on:
- A locally runnable omni model — speech-to-speech end-to-end, not an STT/LLM/TTS pipeline. Pipeline approaches accumulate too much latency and lose prosody. → MiniCPM-o 4.5
- Audio reference (voice cloning) — "speak in this voice" — play one sample once and the model converses in that timbre. → CosyVoice2 spk_emb-based cloning
- Long-term memory — a memory system that doesn't reset every session, that actually remembers the user. → our own Naia Memory (4-tier neuroscience-inspired memory system)
- Korean — for daily use, all three of the above need to work in Korean. → CosyVoice2 Korean fine-tuning (AIHub data already secured, ready to start once a GPU is available)
There's another track coming soon as well: naia-sing — you can probably guess what it is from the name. More on that shortly. For now, what follows are the actual running demos and some technical detail.
Demo 1 — Conversation on the Naia Client
Please excuse my imperfect English in the demo video.
This is a recording of a conversation test after porting MiniCPM 4.5-o to vLLM-omni and connecting it to the Naia desktop app. Hardware: RTX-3090 ×2 (2-way), bf16 with no quantization. As you'd expect from an omni model (S2S, end-to-end speech-to-speech), the conversation flows smoothly and the natural emotional expression is preserved. English and Chinese are supported today.
Demo 2 — MiniCPM-o demo page + audio reference
This is a fork of the official MiniCPM-o 4.5 demo page wired to a vLLM-omni backend. It includes an audio reference (voice cloning) example — upload a WAV file and the response is synthesized in that voice's timbre and intonation. On the server side, an SHA-256 keyed LRU cache prevents re-computation on the same reference, so after the first response there's almost no extra overhead.
Implementation Summary
Work spread across three repositories.
| Repo | Role |
|---|---|
nextain/vllm-omni | vLLM-omni fork — MiniCPM-o 4.5 model module + Thinker→Talker→Code2Wav 3-stage pipeline + voice cloning (session.update.ref_audio) |
nextain/MiniCPM-o-Demo-forvLLM-omni | Fork of the official PyTorch demo — adds a backend=vllm_omni mode and lets you set/verify the vLLM-omni URL right from the audio-duplex page |
nextain/naia-os | Naia desktop app — adds a first-class MiniCpmOConfig.refAudio field, plus an in-browser (Tauri webview) AudioContext → 16 kHz mono → base64 WAV encoder |
The Naia client connects directly to vLLM-omni's /v1/realtime (OpenAI Realtime API compatible) — bypassing the demo gateway entirely. PCM16 16 kHz audio goes out over WebSocket and 24 kHz mono PCM16 comes back. Everything was driven through Claude Code: model code, stage-config YAMLs, stage_input_processor, the demo backend proxy, and the Naia TypeScript client + WAV encoder + vitest.
- Naia is mid-restructure. Parts of Naia-OS are moving toward a flexible CLI-backed
naia-agent, which will combine with Naia-Memory and Naia-ADK.
But AI Slop Is Still There
In the previous post I wrote "the goal of this work is to make an AI-native open-source ecosystem real — to show AI can contribute to open source without ai slop". That's still the hardest part, and a final pass turned up new problems again. I wanted to share this fast, so I'm publishing the post before fixing them.
Over the past few days I ran another AI agent as an adversarial reviewer four times. Round 1: 2 BLOCKER + 13 MAJOR. One of those was a flat-out lie sitting in the code: "the example client doesn't work — the backend is audio-driven but the example was written text-driven". Round 2: 1 MAJOR (the example used the stdlib audioop module, removed in Python 3.13 under PEP 594). Rounds 3 and 4: clean. Our internal bar (two consecutive clean passes) was met. Then I ran one more pass — this time from a vLLM-project maintainer's perspective — and it still has issues:
- 3 BLOCKER:
- To enable a single model's feature, the cross-cutting invariant in
chunk_transfer_adapter.pywas flipped (whitelist → blacklist) prompt_len_overrideis MiniCPM-o-specific yet sits in shared infra- The
chat_template_kwargs.ref_audiosidechannel is a layering violation — there's already a first-class field precedent (speaker,language) right next to it
- To enable a single model's feature, the cross-cutting invariant in
- 5 MAJOR + 8 MINOR
Internal rules say it's good; external maintainer standards say it won't merge in a single round. On top of that, upstream/main has moved further past our fork's merge-base, so we're adding the merge work too.
Next Steps
- Fix the upstream-perspective BLOCKER/MAJOR — in progress
- Merge upstream/main + resolve conflicts — coming up
- Submit the PR — after the two above. Single PR vs split into two (model module / voice clone) is still TBD.
- Korean fine-tuning — at the model level. CosyVoice2 (the Code2Wav backbone) wasn't trained on Korean, which is the biggest wall. Korean text generation works fine; the audio synthesis is what comes out garbled.
I'll keep sharing as the work continues. The point of this exercise is to show that AI can contribute meaningfully to open source. Comments and GitHub Issues are very welcome.
Repos
Naia: https://naia.nextain.io