Naia Sings — Korean SVS Benchmark First Baseline (Vevo1.5 base, 23 songs evaluated)

Hello, I'm Luke, building Naia.

Naia's voice model has reached some results, and now I'm looking at singing too. I want Alpha to sing for me. In particular, I'm interested in cover songs (translating foreign songs into Korean).

I started a month ago, hit a ceiling, and recently got slightly better results — feels like more digging will yield something. Previously Korean lyrics and pitch were both alien-level garbage; now it actually sings something like Korean.

Still sounds a bit drunk though. ^^ Still, since something came out and it's fun, I'm sharing the progress along with the report. Someday I hope it really sings for me.

Naia-Sing is not yet stabilized — this is closer to a progress share than a launch. The official release order is Naia-talk (Naia-Omni) first (already stabilized), and Naia-Sing follows.

TL;DR

23 songs evaluated, composite score 27.8~~60.6 (0~~100 scale)
User-rated BEST 3 songs = metric rank 1·3·5, WORST 2 = rank 17·19 → framework valid confirmed
Hard Gate (service-ready minimum) = 0/23 passed — current base is not service-ready
Biggest weakness: A. Intelligibility (avg CER 1.18) + E. Expression (only 1/23)
Next step = apply Track A 247h Korean fine-tune + shorter inference + syllable-accurate lyrics

#	Song	Source	Score	Strength
1	banan_jaychou_translation	Jay Chou ballad (13s, ko translation)	60.6	CER 0.52 lowest overall · ko 0.96
2	genre_digicharat	Digi Charat Party Night (1999)	50.8	ko 0.97 · timbre 0.92
3	genre_gunslinger	Gunslinger Girl doll · Lia	48.8	timbre 0.93 · ballad match
4	genre_escaflowne	Escaflowne OP (1996)	48.2	f0 range 0.91 match
5	banan_adele_translation	Adele — Someone Like You (13s)	47.8	E.energy 0.72 highest
6	base_gunslinger	Same song, KO-S1 baseline	47.8	CER 0.70 (2nd lowest)
7	genre_macross	Macross — Do You Remember Love? (1984)	47.7	ko 0.97 · timbre 0.92
8	genre_chobits	Let Me Be With You (2002)	46.7	timbre 0.93
9	pipe_01_adele	Adele source 13s pipeline	45.9	f0_corr 0.81 highest
10	genre_nadia	Nadia: Secret of Blue Water (1990)	44.2	timbre 0.97 highest

1. "Singing Korean well" — 5 independent dimensions

Cannot be reduced to one metric. Academia (TCSinger, Vevo, DiffSinger) also reports 4-5 dimensions together.

Dim	Academic metric	Our measurement	Meaning
A. Intelligibility	CER, PER	Whisper-small KO STT vs intended lyrics edit distance	"Are the lyrics audible"
B. Pitch	F0 RMSE, F0 corr	librosa.pyin F0 pearson + range ratio	"Hitting the notes"
C. Timbre similarity	SECS (ECAPA cosine)	MFCC mean cosine (proxy)	"Same singer"
D. Naturalness	MOS-N, UTMOS	chunk_disc per min (proxy)	"Not machine-like"
E. Expression (cover-only)	(own definition)	source RMS + spectral centroid + vibrato corr	"Follows original's expression"

Dimension E is core to cover songs. General SVS = the score is the expression ground truth. Cover = source vocal's dynamics/vibrato/articulation is the ground truth.

Hard Gate (service-ready minimum)

A. CER ≤ 0.30  AND  ko_prob ≥ 0.90      [Korean pronunciation]
B. f0_corr ≥ 0.50  AND  range 0.6~1.4   [Pitch]
C. timbre_sim ≥ 0.65                      [Timbre]
D. chunk_disc ≤ 2/min                     [Smoothness]
E. energy_corr ≥ 0.5 · brightness ≥ 0.4 · vibrato 0.5~1.5  [Expression]

Academic basis: CER 0.30 = production STT threshold; SECS 0.60-0.70 = zero-shot SVC passing standard.

2. Framework Self-Validation

Lesson — Whisper-small was invalid for measuring SVS quality (svs_eval.py §43, 5/17): even golden in-distribution samples produced empty output. Singing only triggered speech-likeness.

→ Self-validation is mandatory:

User best  vs  user worst
     ↓             ↓
   Must be separable by the metric
     ↓
   Otherwise the metric itself is invalid

23-song self-validation result:

User rating	Song	Metric ranking	Separated
🟢 BEST	banan_jaychou_translation	1/23	✓
🟢 BEST	genre_gunslinger	3/23	✓
🟢 BEST	banan_adele_translation	5/23	✓
🔴 WORST	base_macross	17/23	✓
🔴 WORST	base_flcl	19/23	✓

→ Framework valid (ranking proxy confirmed). We now have a tool to objectively compare Track A training results.

3. Hard Gate Pass — 0/23 ❌

Dim	Criterion	Pass
A. CER ≤ 0.30	Academic production	0/23 ❌
A. ko_prob ≥ 0.90	Whisper Korean detection	14/23
B. f0_corr ≥ 0.5 + range OK	Pitch preservation	2/23 ❌
C. timbre_sim ≥ 0.65	Ref consistency	10/23
D. disc ≤ 2/min	Chunk artifact	20/23 ✓
E. (3-metric AND)	Cover expression	1/23 ❌

Current Vevo1.5 Korean base = 0% service-ready. A·B·E are the critical gaps.

4. Diagnosis — Per-dimension weakness and cause

A. Intelligibility — Vevo Korean phoneme synthesis is weak

Vevo1.5 pre-trained on Sing-0.4k (438h including 3.8h Korean CSD). Yet Korean phoneme mapping is poor — CER avg 1.18 = 70% broken. Only ko_prob 0.86 = "sounds like Korean acoustically" but not as phonemes.

B. Pitch — melody_control under-trained on melody

F0_corr avg 0.18. Several negative (gunslinger -0.20, gsteatrino -0.46) = anti-correlated with source. vevosing_melody_control prioritizes phoneme content transfer; F0 contour is secondary.

C. Timbre — ref matching works

Songs with AI-Hub 8 refs avg timbre_sim 0.91. But timbre drift at chunk boundaries = user feedback "different voices keep entering".

D. Smoothness — chunk merge OK

20/23 pass. 100ms crossfade effective. 3 outliers only when source vocal has heavy silence.

E. Expression — biggest failure

1/23 pass. Avg energy_corr 0.32, brightness_corr 0.19. Vevo prosody tokenizer is Korean-undertrained + dynamics lost during content-style separation.

5. What should we do — prioritized improvements

🟢 P0 (immediate, verified effect)

Item	Effect	Note
Apply Track A training results	A·B·E simultaneously	247h KO fine-tune in progress
Single-chunk inference	C·D	60s song → 15s chorus → matches #1 sweet spot
Syllable-accurate lyrics	A	SimpleAligner equal-split → match source syllable count

Recipe of #1 (banan_jaychou_translation): short (13s) + syllable-matched + clean studio source

🟡 P1 (within a week)

Item	Effect
Introduce ECAPA-TDNN SECS	C accuracy (replace MFCC proxy)
UTMOS / Sing-MOS predictor	D naturalness objectification
Phrase-aware aligner	A·D (no more equal-syllable split)
Suno API → Korean source	A·E (Korean vocal source pool)

🟠 P2 (mid-term, verification needed)

Item	Effect	Risk
Vevo1.5 fine-tune (Track A)	A·B·E all	Result unverified
Re-examine TCSinger2 pivot	Expression boost possible	Stack complex, ceiling risk
F5-TTS / CosyVoice2 SVS adapter	A intelligibility	Adapter missing
SVC post-processing (RVC/SoulX)	C timbre	End-to-end CER must be measured

🔴 P3 (long-term, paradigm shift)

Own SVS model training → verified external base is more advantageous (Naia core philosophy)
Subjective MOS listening test → absolute ground truth
Suno + own cover pipeline productization → "AI singing service" Track D

6. Next Decision Points

When Track A finishes → re-evaluate with this framework, measure improvement delta
Improvement < 20% → drop Vevo1.5 base, reconsider TCSinger2 or F5-TTS
Improvement ≥ 30% → full fine-tune + Track D service entry
User gate = listen to Track A results + check all 5 dimensions, then decide

7. naia-sing 32-day research log

Based on git history — no hallucinations. Real start = 2026-04-25 (one month of history).

2026-04-25 ─ Start: project setup + RVC pipeline scripts.
2026-04-26 ─ Source data + cross-lingual cover pipeline + IP groundwork.
2026-04-28 ─ 3-frontier lyrics + Korean SVS verification + phone-call ref voice pipeline.
2026-05-15 ─ DSKR (DiffSinger Korean) + AI-Hub 465 Korean SVS training stack confirmed.
2026-05-15 ─ IO bottleneck root-fix — pickle 13x light + HDD→SSD (avoid 58s/step).
2026-05-16 ─ Working stack confirmed — DSKR+CSD transfer → RVC swap. User verified OK.
2026-05-16 ─ Vocoder ceiling refuted + extend training 300k→500k.
2026-05-16 ─ Extended training overfit refuted — best=S_300000 locked, step axis closed.
2026-05-18 ─ BigVGAN wired as DSKR default vocoder + aihubshell key security patch.
2026-05-19 ─ DSKR S_300000 ceiling realness confirmed (MCD37 vs 48 cross-song).
2026-05-19 ─ TCSinger 2 Korean SVS started → PAUSED @ VAE epoch 395.
2026-05-21 ─ R3 adversarial review consolidation + naia-sing resource gate hold.
2026-05-26 ─ TCSinger2 closed → Vevo1.5 Korean SVS discovered.
2026-05-26 ─ Korean singing dataset 247h complete (GTSinger + AI-Hub 465).
2026-05-26 ─ BASE STATE SNAPSHOT — baseline lock before fine-tune.
2026-05-26 ─ 3-AI cross-review (codex+gemini+opencode) + Reality Check.
2026-05-26 ─ Track A Phase 1 AR smoke PASS + Phase 2 autonomous wrapper.
2026-05-26 ─ Per-stage VRAM measurement + 5 cover batch + 10 anime OST 60s.
2026-05-26 ─ Critical finding: Vevo melody_control = source vocal phoneme dependent.
2026-05-26 ─ AI-Hub 8 short refs genre-matched batch + ffmpeg loudnorm post-processing.
2026-05-27 ─ This report: 5-dim evaluation framework + 23-song baseline + Hard Gate.
2026-05-27 ─ Framework valid confirmed + Top 10 compilation video.

About 32 days:

5 different SVS stacks tried (RVC → DSKR → SoulX → TCSinger2 → Vevo1.5)
1 ceiling reached (DSKR S_300000) + 1 paused (TCSinger2 VAE) + 1 dropped (RVC pure)
Current = Vevo1.5 external base + Track A 247h KO fine-tune in progress

→ No own model training. Differentiated by verified external base + our stack (memory / privacy / RAG / local serving). Naia core philosophy, paradigm locked 2026-05-15.

Honest Limits

This is not a report that the base is good. Rather it's a report that objectively measures the base as 0% service-ready. We now have a metric framework whose ranking aligns with human listening — a tool to honestly check whether training actually works. That is the meaning of this post.

When Track A finishes, we will re-evaluate the same 23 songs and report the improvement delta quantitatively.

Source markdown (SoT) — .agents/work/benchmark-2026-05-27/BENCHMARK_REPORT.md Evaluation framework SoT — same path, README.md Top 10 compilation video — top10_mp4/naia_sing_top10.mp4 (7m24s, 18MB)