Naia
· Luke· 1

Naia Sings — Korean SVS Benchmark First Baseline (Vevo1.5 base, 23 songs evaluated)

naia-singsvskorean-ttsvevobenchmarkai-singing

Hello, I'm Luke, building Naia.

Naia's voice model has reached some results, and now I'm looking at singing too. I want Alpha to sing for me. In particular, I'm interested in cover songs (translating foreign songs into Korean).

I started a month ago, hit a ceiling, and recently got slightly better results — feels like more digging will yield something. Previously Korean lyrics and pitch were both alien-level garbage; now it actually sings something like Korean.

Still sounds a bit drunk though. ^^ Still, since something came out and it's fun, I'm sharing the progress along with the report. Someday I hope it really sings for me.

Naia-Sing is not yet stabilized — this is closer to a progress share than a launch. The official release order is Naia-talk (Naia-Omni) first (already stabilized), and Naia-Sing follows.

Alpha singing

TL;DR

  • 23 songs evaluated, composite score 27.860.6 (0100 scale)
  • User-rated BEST 3 songs = metric rank 1·3·5, WORST 2 = rank 17·19framework valid confirmed
  • Hard Gate (service-ready minimum) = 0/23 passed — current base is not service-ready
  • Biggest weakness: A. Intelligibility (avg CER 1.18) + E. Expression (only 1/23)
  • Next step = apply Track A 247h Korean fine-tune + shorter inference + syllable-accurate lyrics

Top 10 Compilation Video

7 min 24 sec. Rank #1 to #10 one by one. Each track normalized via ffmpeg loudnorm -14 LUFS + dynaudnorm. The video makes the genre matching, timbre consistency, pronunciation trade-offs audible.

#SongSourceScoreStrength
1banan_jaychou_translationJay Chou ballad (13s, ko translation)60.6CER 0.52 lowest overall · ko 0.96
2genre_digicharatDigi Charat Party Night (1999)50.8ko 0.97 · timbre 0.92
3genre_gunslingerGunslinger Girl doll · Lia48.8timbre 0.93 · ballad match
4genre_escaflowneEscaflowne OP (1996)48.2f0 range 0.91 match
5banan_adele_translationAdele — Someone Like You (13s)47.8E.energy 0.72 highest
6base_gunslingerSame song, KO-S1 baseline47.8CER 0.70 (2nd lowest)
7genre_macrossMacross — Do You Remember Love? (1984)47.7ko 0.97 · timbre 0.92
8genre_chobitsLet Me Be With You (2002)46.7timbre 0.93
9pipe_01_adeleAdele source 13s pipeline45.9f0_corr 0.81 highest
10genre_nadiaNadia: Secret of Blue Water (1990)44.2timbre 0.97 highest

1. "Singing Korean well" — 5 independent dimensions

Cannot be reduced to one metric. Academia (TCSinger, Vevo, DiffSinger) also reports 4-5 dimensions together.

DimAcademic metricOur measurementMeaning
A. IntelligibilityCER, PERWhisper-small KO STT vs intended lyrics edit distance"Are the lyrics audible"
B. PitchF0 RMSE, F0 corrlibrosa.pyin F0 pearson + range ratio"Hitting the notes"
C. Timbre similaritySECS (ECAPA cosine)MFCC mean cosine (proxy)"Same singer"
D. NaturalnessMOS-N, UTMOSchunk_disc per min (proxy)"Not machine-like"
E. Expression (cover-only)(own definition)source RMS + spectral centroid + vibrato corr"Follows original's expression"

Dimension E is core to cover songs. General SVS = the score is the expression ground truth. Cover = source vocal's dynamics/vibrato/articulation is the ground truth.

Hard Gate (service-ready minimum)

A. CER ≤ 0.30  AND  ko_prob ≥ 0.90      [Korean pronunciation]
B. f0_corr ≥ 0.50  AND  range 0.6~1.4   [Pitch]
C. timbre_sim ≥ 0.65                      [Timbre]
D. chunk_disc ≤ 2/min                     [Smoothness]
E. energy_corr ≥ 0.5 · brightness ≥ 0.4 · vibrato 0.5~1.5  [Expression]

Academic basis: CER 0.30 = production STT threshold; SECS 0.60-0.70 = zero-shot SVC passing standard.


2. Framework Self-Validation

Lesson — Whisper-small was invalid for measuring SVS quality (svs_eval.py §43, 5/17): even golden in-distribution samples produced empty output. Singing only triggered speech-likeness.

Self-validation is mandatory:

User best  vs  user worst
     ↓             ↓
   Must be separable by the metric
     ↓
   Otherwise the metric itself is invalid

23-song self-validation result:

User ratingSongMetric rankingSeparated
🟢 BESTbanan_jaychou_translation1/23
🟢 BESTgenre_gunslinger3/23
🟢 BESTbanan_adele_translation5/23
🔴 WORSTbase_macross17/23
🔴 WORSTbase_flcl19/23

Framework valid (ranking proxy confirmed). We now have a tool to objectively compare Track A training results.


3. Hard Gate Pass — 0/23 ❌

DimCriterionPass
A. CER ≤ 0.30Academic production0/23
A. ko_prob ≥ 0.90Whisper Korean detection14/23
B. f0_corr ≥ 0.5 + range OKPitch preservation2/23
C. timbre_sim ≥ 0.65Ref consistency10/23
D. disc ≤ 2/minChunk artifact20/23 ✓
E. (3-metric AND)Cover expression1/23

Current Vevo1.5 Korean base = 0% service-ready. A·B·E are the critical gaps.


4. Diagnosis — Per-dimension weakness and cause

A. Intelligibility — Vevo Korean phoneme synthesis is weak

Vevo1.5 pre-trained on Sing-0.4k (438h including 3.8h Korean CSD). Yet Korean phoneme mapping is poor — CER avg 1.18 = 70% broken. Only ko_prob 0.86 = "sounds like Korean acoustically" but not as phonemes.

B. Pitch — melody_control under-trained on melody

F0_corr avg 0.18. Several negative (gunslinger -0.20, gsteatrino -0.46) = anti-correlated with source. vevosing_melody_control prioritizes phoneme content transfer; F0 contour is secondary.

C. Timbre — ref matching works

Songs with AI-Hub 8 refs avg timbre_sim 0.91. But timbre drift at chunk boundaries = user feedback "different voices keep entering".

D. Smoothness — chunk merge OK

20/23 pass. 100ms crossfade effective. 3 outliers only when source vocal has heavy silence.

E. Expression — biggest failure

1/23 pass. Avg energy_corr 0.32, brightness_corr 0.19. Vevo prosody tokenizer is Korean-undertrained + dynamics lost during content-style separation.


5. What should we do — prioritized improvements

🟢 P0 (immediate, verified effect)

ItemEffectNote
Apply Track A training resultsA·B·E simultaneously247h KO fine-tune in progress
Single-chunk inferenceC·D60s song → 15s chorus → matches #1 sweet spot
Syllable-accurate lyricsASimpleAligner equal-split → match source syllable count

Recipe of #1 (banan_jaychou_translation): short (13s) + syllable-matched + clean studio source

🟡 P1 (within a week)

ItemEffect
Introduce ECAPA-TDNN SECSC accuracy (replace MFCC proxy)
UTMOS / Sing-MOS predictorD naturalness objectification
Phrase-aware alignerA·D (no more equal-syllable split)
Suno API → Korean sourceA·E (Korean vocal source pool)

🟠 P2 (mid-term, verification needed)

ItemEffectRisk
Vevo1.5 fine-tune (Track A)A·B·E allResult unverified
Re-examine TCSinger2 pivotExpression boost possibleStack complex, ceiling risk
F5-TTS / CosyVoice2 SVS adapterA intelligibilityAdapter missing
SVC post-processing (RVC/SoulX)C timbreEnd-to-end CER must be measured

🔴 P3 (long-term, paradigm shift)

  • Own SVS model training → verified external base is more advantageous (Naia core philosophy)
  • Subjective MOS listening test → absolute ground truth
  • Suno + own cover pipeline productization → "AI singing service" Track D

6. Next Decision Points

  1. When Track A finishes → re-evaluate with this framework, measure improvement delta
  2. Improvement < 20% → drop Vevo1.5 base, reconsider TCSinger2 or F5-TTS
  3. Improvement ≥ 30% → full fine-tune + Track D service entry
  4. User gate = listen to Track A results + check all 5 dimensions, then decide

7. naia-sing 32-day research log

Based on git history — no hallucinations. Real start = 2026-04-25 (one month of history).

2026-04-25 ─ Start: project setup + RVC pipeline scripts.
2026-04-26 ─ Source data + cross-lingual cover pipeline + IP groundwork.
2026-04-28 ─ 3-frontier lyrics + Korean SVS verification + phone-call ref voice pipeline.
2026-05-15 ─ DSKR (DiffSinger Korean) + AI-Hub 465 Korean SVS training stack confirmed.
2026-05-15 ─ IO bottleneck root-fix — pickle 13x light + HDD→SSD (avoid 58s/step).
2026-05-16 ─ Working stack confirmed — DSKR+CSD transfer → RVC swap. User verified OK.
2026-05-16 ─ Vocoder ceiling refuted + extend training 300k→500k.
2026-05-16 ─ Extended training overfit refuted — best=S_300000 locked, step axis closed.
2026-05-18 ─ BigVGAN wired as DSKR default vocoder + aihubshell key security patch.
2026-05-19 ─ DSKR S_300000 ceiling realness confirmed (MCD37 vs 48 cross-song).
2026-05-19 ─ TCSinger 2 Korean SVS started → PAUSED @ VAE epoch 395.
2026-05-21 ─ R3 adversarial review consolidation + naia-sing resource gate hold.
2026-05-26 ─ TCSinger2 closed → Vevo1.5 Korean SVS discovered.
2026-05-26 ─ Korean singing dataset 247h complete (GTSinger + AI-Hub 465).
2026-05-26 ─ BASE STATE SNAPSHOT — baseline lock before fine-tune.
2026-05-26 ─ 3-AI cross-review (codex+gemini+opencode) + Reality Check.
2026-05-26 ─ Track A Phase 1 AR smoke PASS + Phase 2 autonomous wrapper.
2026-05-26 ─ Per-stage VRAM measurement + 5 cover batch + 10 anime OST 60s.
2026-05-26 ─ Critical finding: Vevo melody_control = source vocal phoneme dependent.
2026-05-26 ─ AI-Hub 8 short refs genre-matched batch + ffmpeg loudnorm post-processing.
2026-05-27 ─ This report: 5-dim evaluation framework + 23-song baseline + Hard Gate.
2026-05-27 ─ Framework valid confirmed + Top 10 compilation video.

About 32 days:

  • 5 different SVS stacks tried (RVC → DSKR → SoulX → TCSinger2 → Vevo1.5)
  • 1 ceiling reached (DSKR S_300000) + 1 paused (TCSinger2 VAE) + 1 dropped (RVC pure)
  • Current = Vevo1.5 external base + Track A 247h KO fine-tune in progress

No own model training. Differentiated by verified external base + our stack (memory / privacy / RAG / local serving). Naia core philosophy, paradigm locked 2026-05-15.


Honest Limits

This is not a report that the base is good. Rather it's a report that objectively measures the base as 0% service-ready. We now have a metric framework whose ranking aligns with human listening — a tool to honestly check whether training actually works. That is the meaning of this post.

When Track A finishes, we will re-evaluate the same 23 songs and report the improvement delta quantitatively.


Source markdown (SoT) — .agents/work/benchmark-2026-05-27/BENCHMARK_REPORT.md Evaluation framework SoT — same path, README.md Top 10 compilation video — top10_mp4/naia_sing_top10.mp4 (7m24s, 18MB)

Popular Posts

CC BY-NC-SA 4.0This post is licensed under CC BY-NC-SA 4.0.

Comments

You can comment without signing in

...