Hello, I'm Luke, building Naia.
Naia's voice model has reached some results, and now I'm looking at singing too. I want Alpha to sing for me. In particular, I'm interested in cover songs (translating foreign songs into Korean).
I started a month ago, hit a ceiling, and recently got slightly better results — feels like more digging will yield something. Previously Korean lyrics and pitch were both alien-level garbage; now it actually sings something like Korean.
Still sounds a bit drunk though. ^^ Still, since something came out and it's fun, I'm sharing the progress along with the report. Someday I hope it really sings for me.
Naia-Sing is not yet stabilized — this is closer to a progress share than a launch. The official release order is Naia-talk (Naia-Omni) first (already stabilized), and Naia-Sing follows.
TL;DR
- 23 songs evaluated, composite score 27.8
60.6 (0100 scale) - User-rated BEST 3 songs = metric rank 1·3·5, WORST 2 = rank 17·19 → framework valid confirmed
- Hard Gate (service-ready minimum) = 0/23 passed — current base is not service-ready
- Biggest weakness: A. Intelligibility (avg CER 1.18) + E. Expression (only 1/23)
- Next step = apply Track A 247h Korean fine-tune + shorter inference + syllable-accurate lyrics
Top 10 Compilation Video
7 min 24 sec. Rank #1 to #10 one by one. Each track normalized via ffmpeg loudnorm -14 LUFS + dynaudnorm. The video makes the genre matching, timbre consistency, pronunciation trade-offs audible.
| # | Song | Source | Score | Strength |
|---|---|---|---|---|
| 1 | banan_jaychou_translation | Jay Chou ballad (13s, ko translation) | 60.6 | CER 0.52 lowest overall · ko 0.96 |
| 2 | genre_digicharat | Digi Charat Party Night (1999) | 50.8 | ko 0.97 · timbre 0.92 |
| 3 | genre_gunslinger | Gunslinger Girl doll · Lia | 48.8 | timbre 0.93 · ballad match |
| 4 | genre_escaflowne | Escaflowne OP (1996) | 48.2 | f0 range 0.91 match |
| 5 | banan_adele_translation | Adele — Someone Like You (13s) | 47.8 | E.energy 0.72 highest |
| 6 | base_gunslinger | Same song, KO-S1 baseline | 47.8 | CER 0.70 (2nd lowest) |
| 7 | genre_macross | Macross — Do You Remember Love? (1984) | 47.7 | ko 0.97 · timbre 0.92 |
| 8 | genre_chobits | Let Me Be With You (2002) | 46.7 | timbre 0.93 |
| 9 | pipe_01_adele | Adele source 13s pipeline | 45.9 | f0_corr 0.81 highest |
| 10 | genre_nadia | Nadia: Secret of Blue Water (1990) | 44.2 | timbre 0.97 highest |
1. "Singing Korean well" — 5 independent dimensions
Cannot be reduced to one metric. Academia (TCSinger, Vevo, DiffSinger) also reports 4-5 dimensions together.
| Dim | Academic metric | Our measurement | Meaning |
|---|---|---|---|
| A. Intelligibility | CER, PER | Whisper-small KO STT vs intended lyrics edit distance | "Are the lyrics audible" |
| B. Pitch | F0 RMSE, F0 corr | librosa.pyin F0 pearson + range ratio | "Hitting the notes" |
| C. Timbre similarity | SECS (ECAPA cosine) | MFCC mean cosine (proxy) | "Same singer" |
| D. Naturalness | MOS-N, UTMOS | chunk_disc per min (proxy) | "Not machine-like" |
| E. Expression (cover-only) | (own definition) | source RMS + spectral centroid + vibrato corr | "Follows original's expression" |
Dimension E is core to cover songs. General SVS = the score is the expression ground truth. Cover = source vocal's dynamics/vibrato/articulation is the ground truth.
Hard Gate (service-ready minimum)
A. CER ≤ 0.30 AND ko_prob ≥ 0.90 [Korean pronunciation]
B. f0_corr ≥ 0.50 AND range 0.6~1.4 [Pitch]
C. timbre_sim ≥ 0.65 [Timbre]
D. chunk_disc ≤ 2/min [Smoothness]
E. energy_corr ≥ 0.5 · brightness ≥ 0.4 · vibrato 0.5~1.5 [Expression]
Academic basis: CER 0.30 = production STT threshold; SECS 0.60-0.70 = zero-shot SVC passing standard.
2. Framework Self-Validation
Lesson — Whisper-small was invalid for measuring SVS quality (svs_eval.py §43, 5/17): even golden in-distribution samples produced empty output. Singing only triggered speech-likeness.
→ Self-validation is mandatory:
User best vs user worst
↓ ↓
Must be separable by the metric
↓
Otherwise the metric itself is invalid
23-song self-validation result:
| User rating | Song | Metric ranking | Separated |
|---|---|---|---|
| 🟢 BEST | banan_jaychou_translation | 1/23 | ✓ |
| 🟢 BEST | genre_gunslinger | 3/23 | ✓ |
| 🟢 BEST | banan_adele_translation | 5/23 | ✓ |
| 🔴 WORST | base_macross | 17/23 | ✓ |
| 🔴 WORST | base_flcl | 19/23 | ✓ |
→ Framework valid (ranking proxy confirmed). We now have a tool to objectively compare Track A training results.
3. Hard Gate Pass — 0/23 ❌
| Dim | Criterion | Pass |
|---|---|---|
| A. CER ≤ 0.30 | Academic production | 0/23 ❌ |
| A. ko_prob ≥ 0.90 | Whisper Korean detection | 14/23 |
| B. f0_corr ≥ 0.5 + range OK | Pitch preservation | 2/23 ❌ |
| C. timbre_sim ≥ 0.65 | Ref consistency | 10/23 |
| D. disc ≤ 2/min | Chunk artifact | 20/23 ✓ |
| E. (3-metric AND) | Cover expression | 1/23 ❌ |
Current Vevo1.5 Korean base = 0% service-ready. A·B·E are the critical gaps.
4. Diagnosis — Per-dimension weakness and cause
A. Intelligibility — Vevo Korean phoneme synthesis is weak
Vevo1.5 pre-trained on Sing-0.4k (438h including 3.8h Korean CSD). Yet Korean phoneme mapping is poor — CER avg 1.18 = 70% broken. Only ko_prob 0.86 = "sounds like Korean acoustically" but not as phonemes.
B. Pitch — melody_control under-trained on melody
F0_corr avg 0.18. Several negative (gunslinger -0.20, gsteatrino -0.46) = anti-correlated with source. vevosing_melody_control prioritizes phoneme content transfer; F0 contour is secondary.
C. Timbre — ref matching works
Songs with AI-Hub 8 refs avg timbre_sim 0.91. But timbre drift at chunk boundaries = user feedback "different voices keep entering".
D. Smoothness — chunk merge OK
20/23 pass. 100ms crossfade effective. 3 outliers only when source vocal has heavy silence.
E. Expression — biggest failure
1/23 pass. Avg energy_corr 0.32, brightness_corr 0.19. Vevo prosody tokenizer is Korean-undertrained + dynamics lost during content-style separation.
5. What should we do — prioritized improvements
🟢 P0 (immediate, verified effect)
| Item | Effect | Note |
|---|---|---|
| Apply Track A training results | A·B·E simultaneously | 247h KO fine-tune in progress |
| Single-chunk inference | C·D | 60s song → 15s chorus → matches #1 sweet spot |
| Syllable-accurate lyrics | A | SimpleAligner equal-split → match source syllable count |
Recipe of #1 (banan_jaychou_translation): short (13s) + syllable-matched + clean studio source
🟡 P1 (within a week)
| Item | Effect |
|---|---|
| Introduce ECAPA-TDNN SECS | C accuracy (replace MFCC proxy) |
| UTMOS / Sing-MOS predictor | D naturalness objectification |
| Phrase-aware aligner | A·D (no more equal-syllable split) |
| Suno API → Korean source | A·E (Korean vocal source pool) |
🟠 P2 (mid-term, verification needed)
| Item | Effect | Risk |
|---|---|---|
| Vevo1.5 fine-tune (Track A) | A·B·E all | Result unverified |
| Re-examine TCSinger2 pivot | Expression boost possible | Stack complex, ceiling risk |
| F5-TTS / CosyVoice2 SVS adapter | A intelligibility | Adapter missing |
| SVC post-processing (RVC/SoulX) | C timbre | End-to-end CER must be measured |
🔴 P3 (long-term, paradigm shift)
- Own SVS model training → verified external base is more advantageous (Naia core philosophy)
- Subjective MOS listening test → absolute ground truth
- Suno + own cover pipeline productization → "AI singing service" Track D
6. Next Decision Points
- When Track A finishes → re-evaluate with this framework, measure improvement delta
- Improvement < 20% → drop Vevo1.5 base, reconsider TCSinger2 or F5-TTS
- Improvement ≥ 30% → full fine-tune + Track D service entry
- User gate = listen to Track A results + check all 5 dimensions, then decide
7. naia-sing 32-day research log
Based on git history — no hallucinations. Real start = 2026-04-25 (one month of history).
2026-04-25 ─ Start: project setup + RVC pipeline scripts.
2026-04-26 ─ Source data + cross-lingual cover pipeline + IP groundwork.
2026-04-28 ─ 3-frontier lyrics + Korean SVS verification + phone-call ref voice pipeline.
2026-05-15 ─ DSKR (DiffSinger Korean) + AI-Hub 465 Korean SVS training stack confirmed.
2026-05-15 ─ IO bottleneck root-fix — pickle 13x light + HDD→SSD (avoid 58s/step).
2026-05-16 ─ Working stack confirmed — DSKR+CSD transfer → RVC swap. User verified OK.
2026-05-16 ─ Vocoder ceiling refuted + extend training 300k→500k.
2026-05-16 ─ Extended training overfit refuted — best=S_300000 locked, step axis closed.
2026-05-18 ─ BigVGAN wired as DSKR default vocoder + aihubshell key security patch.
2026-05-19 ─ DSKR S_300000 ceiling realness confirmed (MCD37 vs 48 cross-song).
2026-05-19 ─ TCSinger 2 Korean SVS started → PAUSED @ VAE epoch 395.
2026-05-21 ─ R3 adversarial review consolidation + naia-sing resource gate hold.
2026-05-26 ─ TCSinger2 closed → Vevo1.5 Korean SVS discovered.
2026-05-26 ─ Korean singing dataset 247h complete (GTSinger + AI-Hub 465).
2026-05-26 ─ BASE STATE SNAPSHOT — baseline lock before fine-tune.
2026-05-26 ─ 3-AI cross-review (codex+gemini+opencode) + Reality Check.
2026-05-26 ─ Track A Phase 1 AR smoke PASS + Phase 2 autonomous wrapper.
2026-05-26 ─ Per-stage VRAM measurement + 5 cover batch + 10 anime OST 60s.
2026-05-26 ─ Critical finding: Vevo melody_control = source vocal phoneme dependent.
2026-05-26 ─ AI-Hub 8 short refs genre-matched batch + ffmpeg loudnorm post-processing.
2026-05-27 ─ This report: 5-dim evaluation framework + 23-song baseline + Hard Gate.
2026-05-27 ─ Framework valid confirmed + Top 10 compilation video.
About 32 days:
- 5 different SVS stacks tried (RVC → DSKR → SoulX → TCSinger2 → Vevo1.5)
- 1 ceiling reached (DSKR S_300000) + 1 paused (TCSinger2 VAE) + 1 dropped (RVC pure)
- Current = Vevo1.5 external base + Track A 247h KO fine-tune in progress
→ No own model training. Differentiated by verified external base + our stack (memory / privacy / RAG / local serving). Naia core philosophy, paradigm locked 2026-05-15.
Honest Limits
This is not a report that the base is good. Rather it's a report that objectively measures the base as 0% service-ready. We now have a metric framework whose ranking aligns with human listening — a tool to honestly check whether training actually works. That is the meaning of this post.
When Track A finishes, we will re-evaluate the same 23 songs and report the improvement delta quantitatively.
Source markdown (SoT) — .agents/work/benchmark-2026-05-27/BENCHMARK_REPORT.md
Evaluation framework SoT — same path, README.md
Top 10 compilation video — top10_mp4/naia_sing_top10.mp4 (7m24s, 18MB)
