External Publication

A ~10KB page that sings text — can you decode it back? (pitch-only baseline already hits 43%)

Hugging Face Forums [Unofficial] June 20, 2026

This looked interesting, so I checked whether there are reusable pieces around it:

Short answer

Yes — I think this is very plausibly decodable and developable, but I would not start by framing it as only “ASR” or only “a language question”.

A useful nearby frame is:

a tiny deterministic singing-code / synthetic Singing Voice Transcription sandbox

In other words, this can be separated into several layers:

audio → notes / motifs
audio or motifs → text
word ↔ motif alignment

That separation seems useful because nearby fields already have reusable tools, metrics, datasets, and failure-mode maps.

The interesting part is that this project has something real singing datasets usually do not have: a generator that can produce perfect labels for audio, text, notes/motifs, timing, vowels, and word↔motif alignment.

So before jumping to a large neural ASR model, I would probably grow it in this order:

toy demo
→ reproducible benchmark
→ structured metadata
→ note/motif scoring
→ text scoring
→ word↔motif alignment scoring
→ stronger non-ML baselines
→ toy / injective / noisy / style variants
→ CTC / Wav2Vec2 / SVT-style joint models
→ human-learnability / code-language boundary questions

Why Singing Voice Transcription seems like a useful nearby frame

A particularly useful neighboring area is Singing Voice Transcription , especially newer work that tries to unify lyric transcription, note transcription, and lyric-note alignment.

In ordinary speech ASR, the target is mostly:

audio → text

But in singing transcription, the target is often closer to:

audio → lyrics
audio → notes / melody
lyrics ↔ notes alignment

That maps surprisingly well onto this project:

SVT / singing transcription layer	This project’s analogue
lyrics transcription	audio → original text
note transcription	audio → note/motif sequence
lyric-note alignment	word ↔ motif alignment
phoneme/vowel alignment	vowel/formant cues inside motifs
style / singer variation	later style/noisy/timbre variants
out-of-distribution singing	later robustness tests

So I would treat this less as “general ASR” at first, and more as a controlled synthetic version of an SVT problem.

Some nearby references:

Reference	Why it seems relevant
VocalParse / GitHub / paper	A unified SVT model that outputs a structured autoregressive token sequence encoding lyrics, pitch, note values, and BPM. Useful as a future “joint structured output” direction.
SongTrans	Transcribes lyrics and notes while aligning them. The “word ↔ note” problem is very close to “word ↔ motif” here.
DALI / paper / GitHub	Dataset with synchronized audio, lyrics, and vocal melody notes. Lyrics are represented at note, word, line, and paragraph granularity. Very useful for metadata design.
SVT_SpeechBrain	Separates automatic lyric transcription and automatic music transcription for singing voice. Useful for the ALT/AMT split.
VOCANO	Singing voice → MIDI/note transcription. Useful for the audio→note/motif layer.
JamendoLyrics / Jam-ALT	Time-aligned lyrics and a benchmark perspective on readable lyric transcription, formatting, non-word sounds, and normalization.
GTSinger	Modern singing corpus with phoneme-level annotations, alignments, style labels, and multilingual singing data. Useful for future style/phoneme/vowel variants.
STARS	Unified singing transcription, alignment, and refined style annotation. Useful for thinking beyond text-only decoding.

A practical benchmark layout

I would split the benchmark into at least three measurable tasks.

Layer	Task	Possible metric family
Note/motif recovery	recover pitch events, notes, or motif sequence from audio	MIR-style note transcription metrics, motif accuracy
Text recovery	recover original word sequence	word accuracy, WER, CER
Alignment recovery	recover which word corresponds to which motif and when	alignment accuracy, timing error
Robustness	recover the same labels under perturbations	score degradation under noise/compression/reverb/etc.

Useful existing tooling:

Tool	Use
mir_eval.transcription	Note onset / offset / pitch style evaluation. Useful for the note/motif layer.
mir_eval.alignment	Event timestamp alignment evaluation. Useful for word↔motif or motif timing checks.
jiwer	WER, CER, MER, WIL, WIP. Useful for the text layer.
librosa.pyin	F0 estimation baseline beyond simple FFT.
aubio	Onset and pitch detection tools. Useful for lightweight note-event baselines.
CREPE	Neural pitch tracking baseline.
AudioFolder	Simple Hugging Face packaging route for audio + metadata.
Lhotse	More structured speech/audio manifest tooling if the benchmark grows.

Suggested metadata

Instead of only storing audio,text, it may be useful to store the generated structure.

Something like:

{
  "audio": "sample_0001.wav",
  "text": "the red bird",
  "words": [
    {
      "word": "red",
      "motif_id": "m_014",
      "notes": ["A4", "C5"],
      "vowels": ["e"],
      "start": 0.42,
      "end": 0.81
    }
  ],
  "codec_version": "loom-v0",
  "split": "test",
  "difficulty": "toy",
  "collision_group": null
}

This would make the dataset usable from several angles:

User wants to test…	They can use…
text decoding	`audio`, `text`
pitch/motif recovery	`audio`, `notes`, `motif_id`
vowel-aware decoding	`audio`, `vowels`, `word`
alignment	`start`, `end`, `word`, `motif_id`
codec audit	`motif_id`, `collision_group`, `codec_version`
robustness	same labels under noisy variants

The DALI dataset is a useful mental model here: it stores lyrics and vocal notes with time alignment and multiple lyric granularities. For this project, the analogous hierarchy could be:

note → motif → word → line/sample

Baseline ladder

I would probably avoid jumping straight from the current pitch-only baseline to a large ASR model.

A more informative ladder might be:

Baseline	What it tells you
current FFT pitch-only greedy decoder	floor baseline
pitch-only + dynamic programming	removes some greedy parsing weakness
pYIN/aubio pitch + DP decoder	better note tracking without full ASR
pitch + vowel-aware decoder	tests whether vowel/formant information helps
oracle note sequence → text	isolates codec/parser ambiguity
oracle note+vowel sequence → text	estimates upper bound from symbolic information
small CTC model	first neural baseline
Wav2Vec2/Whisper-like ASR fine-tuning	general ASR transfer baseline
SVT-style joint decoder	future structured model: audio → motifs + text

This gives more diagnostic value than a single score, because each step answers a different question.

Possible benchmark variants

To keep the idea clean, I would separate variants rather than mixing all goals into one task.

Variant	Goal
toy	current fun/demo version; ambiguity allowed
injective	collision-free / prefix-safe version for stricter codec testing
noisy	compression, additive noise, reverb, resampling, time stretch
style	timbre, vowel, vibrato, voice/singer variation
physical	speaker-to-microphone playback
open-vocab	held-out words or motifs
human	learnability and human decoding experiments

The injective variant would be useful if the question is “can this be a lossless audio code?” The toy variant is still useful if the question is “how far can decoders get under ambiguity?” The noisy/style variants are useful if the question becomes closer to ASR/SVT robustness.

Data-over-sound is another useful nearby field

SVT is useful for the singing/transcription side, but the codec side also has nearby prior art.

Reference	Relevance
ggwave	Data-over-sound library using FSK-style protocols and error correction. Useful for framing, sync, payload, and robustness ideas.
minimodem	Software audio FSK modem. Useful as a classic audio-modem reference.
Quiet	Data-over-sound library built around modem-style signal processing.
Morse code timing	Useful reminder that spacing and boundary timing are part of the code, not just dots/dashes.

This suggests another layer of knobs:

symbol duration
note gap
motif gap
word gap
start/end markers
checksum
error correction
payload rate
collision report

That does not mean this should become a modem. It just means the modem world has useful engineering vocabulary for controlled acoustic codes.

The “language” question

I would keep the language question, but separate it into layers:

Layer	Question
recoverability	can a decoder recover the source text?
unique decodability	is the audio code unambiguous in principle?
robustness	does it survive noise, timing changes, compression, etc.?
learnability	can humans learn it without memorizing arbitrary pairs?
convention	could multiple users share it consistently?
productivity	can it support new utterances systematically?
language-like behavior	does it go beyond codebook lookup?

Nearby conceptual references:

Area	Why it helps
Solresol	Musical constructed language using solfège-like syllables/notes. Useful historical neighbor.
Whistled language	Natural languages transmitted through a reduced acoustic channel. Useful analogy for language under acoustic compression.
Speech surrogates	Broad category including whistled/drummed/instrumental systems that transfer linguistic information into another acoustic medium.
Bora drummed language	Example where rhythm/tone carry linguistic information in a reduced acoustic channel.

So I would phrase the language angle carefully:

High decoding accuracy would be interesting, but it would mostly show recoverability under the chosen code and evaluation setup. Human learnability, convention, and productivity are separate next questions.

That keeps the door open without overclaiming.

A low-friction path

If I were trying to grow this with existing pieces, I would probably do:

Keep the current toy benchmark reproducible
- one command to generate
- one command to decode
- one command to score
Add structured metadata
- text
- words
- motifs
- note timing
- vowel pattern
- codec version
- split
- difficulty
Add two scoring tracks
- mir_eval-style note/motif scoring
- jiwer-style text scoring
Add an alignment track
- word↔motif correctness
- timing error
Add a baseline ladder
- pitch-only
- pitch + DP
- pitch + vowel
- oracle notes
- small neural baseline
- joint structured model later
Split variants
- toy
- injective
- noisy
- style
- physical
Only then try heavier models
- CTC
- Wav2Vec2
- Whisper-like transfer
- SVT-style joint structured decoding
- audio-token / latent alignment approaches

Why this is interesting

The strongest part, to me, is that this creates a small deterministic audio world where several normally tangled problems can be tested separately:

codec design
acoustic decoding
motif recovery
text recovery
vowel cues
alignment
noise robustness
human learnability
code/language boundary

Real SVT datasets have annotation cost, singer variability, accompaniment, noisy alignment, and style differences. This toy system can start with none of that, then add complexity deliberately.

So the most useful framing may be:

a controllable synthetic SVT/audio-code benchmark, with perfect labels and gradually adjustable ambiguity.

That seems like a nice place to test simple decoders first, then stronger ASR/SVT models later.