Naia
· Luke· 1

Gemini 3.1 Flash Live Support + MiniCPM-o 4.5 vLLM Challenge — Naia OS's S2S Real-time Voice AI Development Journey

voice-ainaia-osgemini-lives2somni-modelminicpm-ovllmsttttsllmarchitecturelessons-learned

Gemini 3.1 Flash Live Launch — And What Nextain Was Preparing

Today, Gemini 3.1 Flash Live was launched. The timing was serendipitous, so we decided to reveal the technology we've been working on, albeit in an incomplete state. Naia OS was already supporting Gemini Live through Naia accounts and Google providers, and we immediately applied Gemini 3.1 Flash Live. You can see it in action in the video below.

These models are actually called S2S, or Omni models, and they support faster and more natural conversations compared to traditional STT, LLM, and TTS pipelines.

Real-time Voice Conversation Models (S2S, Omni Model) Similar to How Humans Learn to Speak

GPT-4o voice mode, Gemini Live, and MiniCPM-o are prime examples of models that converse not through text, but through real-time voice, even conveying emotions. In English, they are typically referred to as Speech-to-Speech (S2S) or Omni Models. I believed these models resemble humans much more closely than the existing unidirectional STT (Speech-to-Text) or TTS (Text-to-Speech) models. The difficulty hearing-impaired individuals face in learning to speak stems from their inability to hear their own utterances.

Therefore, I concluded that these are the future and the mainstream of voice conversation models. I removed the individual STT/TTS configuration options and integrated them into a 'Live Conversation Model,' adding Gemini Live API and gpt-4o-realtime-preview. However, since both APIs are paid, for free users, I labeled it as 'TTS Only' and included Edge in the conversation model selection. This led to the release of version 0.1.2 recently (yesterday). I integrated the Gemini Live API with Naia accounts and conducted real-world tests, and the conversational responses, in terms of emotional feedback and reaction speed, were highly satisfying.

Misconceptions About Real-time Voice Conversation Models, Which I Mistook for Simple STT/TTS Models

However, there was a significant misunderstanding here. I had understood the Gemini Live API to be a next-generation STT/TTS integrated model, but I later discovered that it actually has a specific LLM model, namely Gemini 2.5 Flash, embedded within it, and the LLM cannot be changed.

MiniCPM-o-4.5: A Real-time Voice Conversation Model Usable Locally

I realized this misunderstanding while searching for and applying a real-time voice conversation model that could run on an RTX-3090 class GPU, as Naia aims to support local models within the user's accessible range. The initial candidates were Moshi, Qwen3-Omni-30b, and MiniCPM-o-4.5. Moshi, though a pioneer in this field in open source, primarily supports English/French and lacks an active community, so I didn't choose it. Qwen3-omni-30b, despite its recent excellent performance, was out of the question for a single RTX 3090's 24GB VRAM and was unverified. Finally, MiniCPM-o was chosen. Although it doesn't currently support Korean, I judged that with some capital investment, additional language fine-tuning would be possible, and since it supports reference audio, users could even implement their desired TTS.

Furthermore, I thought its small size meant that if run alongside Qwen3-omni-30b-a3b with the remaining VRAM on a 24GB card, it could significantly boost LLM performance even on an RTX-3090 class system. While Qwen3-omni is also a real-time voice conversation (Omni) model as mentioned earlier, I didn't adopt it because its audio output breaks when quantized, making it unusable on consumer GPUs with 24GB. However, as an LLM, its quantized model is said to show good performance.

So, I deployed MiniCPM-o on a local GPU (RunPod RTX 3090) and attached it to the Naia app by creating a Websocket bridge server. What I discovered during this process was that the model wasn't outputting what it heard. But the crucial realization here was that the embedded Qwen3-8B model was handling the responses, and the Omni model is trained end-to-end, making it impossible to swap out components. So, MiniCPM-o would actually be more appropriately called Qwen3-8b-omni. More precisely, it reportedly uses Whisper-medium for listening, Qwen3-8b for its 'brain,' CosyVoice2 for speech synthesis, and SigLip2 for vision. These different models are stitched together and retrained with multimodal data. While broadly considered fine-tuning, the sheer scale of multimodal data involved means the cost can run into tens to hundreds of billions of Korean Won. Recently, there was much debate in Korea's 'Dokpamo' community about whether models were 'from scratch' or not, and this seems to suggest that the real challenge isn't just starting from scratch.

(→ MiniCPM-o Experiment Log)

MiniCPM-o-4.5, Claude Code Challenge for vllm-omni Support

However, since the embedded Qwen3-8b alone is said to be on par with GPT-4o, I couldn't just overlook it. Moreover, with its voice reference capabilities and the potential for LLM fine-tuning, I determined it was a model Nextain, which pursues sovereign AI, absolutely had to pursue. I forked vLLM and deployed it on RunPod, only to find it required 48GB of VRAM. I ran it for almost two days. It was then that I belatedly discovered that there's a separate project called vLLM-omni, and that someone was already working on MiniCPM-o-4.5. So, we commented, offering to support l3 testing. (→ vllm-omni #1182) There's been no response yet, and while waiting, we continued our internal attempts based on vLLM-omni.

This time, I approached it with particular caution. Previously, I had submitted a PR generated with Claude code to another open-source project without fully validating the AI analysis, and I was severely criticized. My code analysis was incomplete, I hadn't followed community rules, and I had even created something outside the project's scope, so it was entirely deserved. Therefore, this time, I focused on gathering context from an upstream perspective of the repository and repeatedly conducted adversarial reviews. Only then did I declare a clean pass, confirmed that it could actually converse when integrated with Naia OS, and created documentation for upstream contribution, which I then reviewed.

However, RunPod went down, and when I tried to bring it back up today, it failed again. Incomplete records from my work came back to haunt me. Without being able to verify the runtime, reproduction itself was impossible. I had to dig through all previous session logs to find what I needed, and as I was reproducing it again and revising the records, I ultimately discovered another critical issue and had to stop. I had used a specific pattern because it was new, not following the patterns of other models in vLLM-omni, and when vLLM-omni updated, everything broke. The analysis results led to another interim report on harness and context control failure, and based on this, I'm now revising the context and harness again, and performing iterative code analysis instead of using the expensive RunPod. ㅜㅜ

https://github.com/nextain/vllm-omni/blob/main/.agents/docs/minicpm-o-midterm-review.md

I had hoped to release a demo video of MiniCPM-o-4.5 running today, coinciding with the launch of Gemini 3.1 Flash Live. But it's proving quite difficult to get it right on the first try. It's now approaching 3:30 AM. Above all, the purpose of this work is an experiment to realize an AI-native Open-source ecosystem, enabling AI to contribute correctly to open source without 'AI slop.' I've arranged to borrow some graphics cards next week, so I expect the situation to get a bit easier.

I will also share updates on whether audio referencing or LLM fine-tuning becomes possible as further work progresses.


References

Popular Posts

CC BY-NC-SA 4.0This post is licensed under CC BY-NC-SA 4.0.

Comments

You can comment without signing in

...