{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreih52g3ft6ul4oianentiwew3zsytj4cqcjj77tict3zoe4rzffklm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3moosi47o5sn2"
},
"path": "/t/a-10kb-page-that-sings-text-can-you-decode-it-back-pitch-only-baseline-already-hits-43/176840#post_6",
"publishedAt": "2026-06-20T01:03:19.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"VocalParse",
"GitHub",
"paper",
"SongTrans",
"DALI",
"SVT_SpeechBrain",
"VOCANO",
"JamendoLyrics",
"Jam-ALT",
"GTSinger",
"STARS",
"mir_eval.transcription",
"mir_eval.alignment",
"jiwer",
"librosa.pyin",
"aubio",
"CREPE",
"AudioFolder",
"Lhotse",
"ggwave",
"minimodem",
"Quiet",
"Morse code timing",
"Solresol",
"Whistled language",
"Speech surrogates",
"Bora drummed language"
],
"textContent": "This looked interesting, so I checked whether there are reusable pieces around it:\n\n* * *\n\n## Short answer\n\nYes — I think this is very plausibly decodable and developable, but I would not start by framing it as only “ASR” or only “a language question”.\n\nA useful nearby frame is:\n\n> **a tiny deterministic singing-code / synthetic Singing Voice Transcription sandbox**\n\nIn other words, this can be separated into several layers:\n\n\n audio → notes / motifs\n audio or motifs → text\n word ↔ motif alignment\n\n\nThat separation seems useful because nearby fields already have reusable tools, metrics, datasets, and failure-mode maps.\n\nThe interesting part is that this project has something real singing datasets usually do not have: **a generator that can produce perfect labels** for audio, text, notes/motifs, timing, vowels, and word↔motif alignment.\n\nSo before jumping to a large neural ASR model, I would probably grow it in this order:\n\n\n toy demo\n → reproducible benchmark\n → structured metadata\n → note/motif scoring\n → text scoring\n → word↔motif alignment scoring\n → stronger non-ML baselines\n → toy / injective / noisy / style variants\n → CTC / Wav2Vec2 / SVT-style joint models\n → human-learnability / code-language boundary questions\n\n\n## Why Singing Voice Transcription seems like a useful nearby frame\n\nA particularly useful neighboring area is **Singing Voice Transcription** , especially newer work that tries to unify lyric transcription, note transcription, and lyric-note alignment.\n\nIn ordinary speech ASR, the target is mostly:\n\n\n audio → text\n\n\nBut in singing transcription, the target is often closer to:\n\n\n audio → lyrics\n audio → notes / melody\n lyrics ↔ notes alignment\n\n\nThat maps surprisingly well onto this project:\n\nSVT / singing transcription layer | This project’s analogue\n---|---\nlyrics transcription | audio → original text\nnote transcription | audio → note/motif sequence\nlyric-note alignment | word ↔ motif alignment\nphoneme/vowel alignment | vowel/formant cues inside motifs\nstyle / singer variation | later style/noisy/timbre variants\nout-of-distribution singing | later robustness tests\n\nSo I would treat this less as “general ASR” at first, and more as a controlled synthetic version of an SVT problem.\n\nSome nearby references:\n\nReference | Why it seems relevant\n---|---\nVocalParse / GitHub / paper | A unified SVT model that outputs a structured autoregressive token sequence encoding lyrics, pitch, note values, and BPM. Useful as a future “joint structured output” direction.\nSongTrans | Transcribes lyrics and notes while aligning them. The “word ↔ note” problem is very close to “word ↔ motif” here.\nDALI / paper / GitHub | Dataset with synchronized audio, lyrics, and vocal melody notes. Lyrics are represented at note, word, line, and paragraph granularity. Very useful for metadata design.\nSVT_SpeechBrain | Separates automatic lyric transcription and automatic music transcription for singing voice. Useful for the ALT/AMT split.\nVOCANO | Singing voice → MIDI/note transcription. Useful for the audio→note/motif layer.\nJamendoLyrics / Jam-ALT | Time-aligned lyrics and a benchmark perspective on readable lyric transcription, formatting, non-word sounds, and normalization.\nGTSinger | Modern singing corpus with phoneme-level annotations, alignments, style labels, and multilingual singing data. Useful for future style/phoneme/vowel variants.\nSTARS | Unified singing transcription, alignment, and refined style annotation. Useful for thinking beyond text-only decoding.\n\n## A practical benchmark layout\n\nI would split the benchmark into at least three measurable tasks.\n\nLayer | Task | Possible metric family\n---|---|---\n**Note/motif recovery** | recover pitch events, notes, or motif sequence from audio | MIR-style note transcription metrics, motif accuracy\n**Text recovery** | recover original word sequence | word accuracy, WER, CER\n**Alignment recovery** | recover which word corresponds to which motif and when | alignment accuracy, timing error\n**Robustness** | recover the same labels under perturbations | score degradation under noise/compression/reverb/etc.\n\nUseful existing tooling:\n\nTool | Use\n---|---\nmir_eval.transcription | Note onset / offset / pitch style evaluation. Useful for the note/motif layer.\nmir_eval.alignment | Event timestamp alignment evaluation. Useful for word↔motif or motif timing checks.\njiwer | WER, CER, MER, WIL, WIP. Useful for the text layer.\nlibrosa.pyin | F0 estimation baseline beyond simple FFT.\naubio | Onset and pitch detection tools. Useful for lightweight note-event baselines.\nCREPE | Neural pitch tracking baseline.\nAudioFolder | Simple Hugging Face packaging route for audio + metadata.\nLhotse | More structured speech/audio manifest tooling if the benchmark grows.\n\n## Suggested metadata\n\nInstead of only storing `audio,text`, it may be useful to store the generated structure.\n\nSomething like:\n\n\n {\n \"audio\": \"sample_0001.wav\",\n \"text\": \"the red bird\",\n \"words\": [\n {\n \"word\": \"red\",\n \"motif_id\": \"m_014\",\n \"notes\": [\"A4\", \"C5\"],\n \"vowels\": [\"e\"],\n \"start\": 0.42,\n \"end\": 0.81\n }\n ],\n \"codec_version\": \"loom-v0\",\n \"split\": \"test\",\n \"difficulty\": \"toy\",\n \"collision_group\": null\n }\n\n\nThis would make the dataset usable from several angles:\n\nUser wants to test… | They can use…\n---|---\ntext decoding | `audio`, `text`\npitch/motif recovery | `audio`, `notes`, `motif_id`\nvowel-aware decoding | `audio`, `vowels`, `word`\nalignment | `start`, `end`, `word`, `motif_id`\ncodec audit | `motif_id`, `collision_group`, `codec_version`\nrobustness | same labels under noisy variants\n\nThe DALI dataset is a useful mental model here: it stores lyrics and vocal notes with time alignment and multiple lyric granularities. For this project, the analogous hierarchy could be:\n\n\n note → motif → word → line/sample\n\n\n## Baseline ladder\n\nI would probably avoid jumping straight from the current pitch-only baseline to a large ASR model.\n\nA more informative ladder might be:\n\nBaseline | What it tells you\n---|---\ncurrent FFT pitch-only greedy decoder | floor baseline\npitch-only + dynamic programming | removes some greedy parsing weakness\npYIN/aubio pitch + DP decoder | better note tracking without full ASR\npitch + vowel-aware decoder | tests whether vowel/formant information helps\noracle note sequence → text | isolates codec/parser ambiguity\noracle note+vowel sequence → text | estimates upper bound from symbolic information\nsmall CTC model | first neural baseline\nWav2Vec2/Whisper-like ASR fine-tuning | general ASR transfer baseline\nSVT-style joint decoder | future structured model: audio → motifs + text\n\nThis gives more diagnostic value than a single score, because each step answers a different question.\n\n## Possible benchmark variants\n\nTo keep the idea clean, I would separate variants rather than mixing all goals into one task.\n\nVariant | Goal\n---|---\n**toy** | current fun/demo version; ambiguity allowed\n**injective** | collision-free / prefix-safe version for stricter codec testing\n**noisy** | compression, additive noise, reverb, resampling, time stretch\n**style** | timbre, vowel, vibrato, voice/singer variation\n**physical** | speaker-to-microphone playback\n**open-vocab** | held-out words or motifs\n**human** | learnability and human decoding experiments\n\nThe **injective** variant would be useful if the question is “can this be a lossless audio code?”\nThe **toy** variant is still useful if the question is “how far can decoders get under ambiguity?”\nThe **noisy/style** variants are useful if the question becomes closer to ASR/SVT robustness.\n\n## Data-over-sound is another useful nearby field\n\nSVT is useful for the singing/transcription side, but the codec side also has nearby prior art.\n\nReference | Relevance\n---|---\nggwave | Data-over-sound library using FSK-style protocols and error correction. Useful for framing, sync, payload, and robustness ideas.\nminimodem | Software audio FSK modem. Useful as a classic audio-modem reference.\nQuiet | Data-over-sound library built around modem-style signal processing.\nMorse code timing | Useful reminder that spacing and boundary timing are part of the code, not just dots/dashes.\n\nThis suggests another layer of knobs:\n\n\n symbol duration\n note gap\n motif gap\n word gap\n start/end markers\n checksum\n error correction\n payload rate\n collision report\n\n\nThat does not mean this should become a modem. It just means the modem world has useful engineering vocabulary for controlled acoustic codes.\n\n## The “language” question\n\nI would keep the language question, but separate it into layers:\n\nLayer | Question\n---|---\nrecoverability | can a decoder recover the source text?\nunique decodability | is the audio code unambiguous in principle?\nrobustness | does it survive noise, timing changes, compression, etc.?\nlearnability | can humans learn it without memorizing arbitrary pairs?\nconvention | could multiple users share it consistently?\nproductivity | can it support new utterances systematically?\nlanguage-like behavior | does it go beyond codebook lookup?\n\nNearby conceptual references:\n\nArea | Why it helps\n---|---\nSolresol | Musical constructed language using solfège-like syllables/notes. Useful historical neighbor.\nWhistled language | Natural languages transmitted through a reduced acoustic channel. Useful analogy for language under acoustic compression.\nSpeech surrogates | Broad category including whistled/drummed/instrumental systems that transfer linguistic information into another acoustic medium.\nBora drummed language | Example where rhythm/tone carry linguistic information in a reduced acoustic channel.\n\nSo I would phrase the language angle carefully:\n\n> High decoding accuracy would be interesting, but it would mostly show recoverability under the chosen code and evaluation setup. Human learnability, convention, and productivity are separate next questions.\n\nThat keeps the door open without overclaiming.\n\n## A low-friction path\n\nIf I were trying to grow this with existing pieces, I would probably do:\n\n 1. **Keep the current toy benchmark reproducible**\n\n * one command to generate\n * one command to decode\n * one command to score\n 2. **Add structured metadata**\n\n * text\n * words\n * motifs\n * note timing\n * vowel pattern\n * codec version\n * split\n * difficulty\n 3. **Add two scoring tracks**\n\n * `mir_eval`-style note/motif scoring\n * `jiwer`-style text scoring\n 4. **Add an alignment track**\n\n * word↔motif correctness\n * timing error\n 5. **Add a baseline ladder**\n\n * pitch-only\n * pitch + DP\n * pitch + vowel\n * oracle notes\n * small neural baseline\n * joint structured model later\n 6. **Split variants**\n\n * toy\n * injective\n * noisy\n * style\n * physical\n 7. **Only then try heavier models**\n\n * CTC\n * Wav2Vec2\n * Whisper-like transfer\n * SVT-style joint structured decoding\n * audio-token / latent alignment approaches\n\n\n\n## Why this is interesting\n\nThe strongest part, to me, is that this creates a small deterministic audio world where several normally tangled problems can be tested separately:\n\n\n codec design\n acoustic decoding\n motif recovery\n text recovery\n vowel cues\n alignment\n noise robustness\n human learnability\n code/language boundary\n\n\nReal SVT datasets have annotation cost, singer variability, accompaniment, noisy alignment, and style differences. This toy system can start with none of that, then add complexity deliberately.\n\nSo the most useful framing may be:\n\n> a controllable synthetic SVT/audio-code benchmark, with perfect labels and gradually adjustable ambiguity.\n\nThat seems like a nice place to test simple decoders first, then stronger ASR/SVT models later.",
"title": "A ~10KB page that sings text — can you decode it back? (pitch-only baseline already hits 43%)"
}