Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicsny75dorueukjbjqu2dhyzdwokxqufub7wft4n4nek5nkxlrtj4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnyrgqa4k5s2"
  },
  "path": "/t/rnn-in-c-is-this-bptt-finally-right/176455#post_7",
  "publishedAt": "2026-06-11T07:26:39.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi. Hmm, I probably can’t identify the exact issue from here, but I tried to look at it from the bigger training-loop picture — basically, “if this piece were missing, would the symptoms look like this?”:\n\n* * *\n\nMy rough read is: you may already be past the “is BPTT basically alive?” wall, and may now be hitting the “what exactly am I measuring?” wall.\n\nI would not read this as a 256-vs-512 question first. I would map it more like this:\n\nLayer | Question | My guess from your description\n---|---|---\nRNN cell | Does the recurrence run at all? | probably yes\nBPTT skeleton | Do gradients flow backward through time in the right shape? | probably mostly yes\nOutput gradient | Is the softmax/NLL gradient entering as `p - target`? | recently clarified\nLoss/reporting | Is the printed number actually target NLL? | suspicious\nEval loop | Is there a held-out NLL/perplexity number? | unclear\nStability | Are gradient scale, clipping, and LR under control? | next likely wall\nCapacity | 256 vs 512, one layer vs two, vanilla vs gated cell | judge after the above\nDeployment | memory, quantization, weight layout | later problem\n\nThe reason I would look at reporting first is this line:\n\n\n    loss -= fmax(0.0, netout[iter][i]) * std::log(fmax(1e-18f, netout[iter][i]));\n\n\nIf `netout[iter][i]` is the model probability for character `i`, then this looks like output entropy:\n\n\n    /*\n     * Entropy of the model's own output distribution.\n     *\n     * This measures how spread out or uncertain the model's prediction is.\n     * It does not look at the actual next character.\n     */\n    entropy = 0.0f;\n\n    for (i = 0; i < Y_S; i++) {\n        entropy -= p[i] * logf(p[i] + 1e-12f);\n    }\n\n\nThat is a useful number sometimes, but it is not the usual next-character language-model loss.\n\nFor next-character prediction, the usual question is much narrower:\n\n> how much probability did the model assign to the actual next character?\n\nThat is target negative log-likelihood:\n\n\n    /*\n     * Next-character negative log-likelihood.\n     *\n     * This number looks at the target.\n     * It asks:\n     *\n     *     how much probability did the model give to the real next char?\n     */\n    nll = -logf(p[target] + 1e-12f);\n\n\nSo I would separate these two questions:\n\n  1. Is the model learning anything?\n  2. Is the printed number measuring next-character prediction?\n\n\n\nThose are not the same question.\n\nA model can have low entropy and still be confidently wrong.\nA model can have higher entropy and still put more probability on the correct next character.\n\nFor the output layer, the shape I would expect is still:\n\n\n    /*\n     * z = raw logits\n     * p = softmax(z)\n     *\n     * For softmax + target NLL, the gradient at the logits is:\n     *\n     *     dz = p - one_hot(target)\n     */\n    for (i = 0; i < Y_S; i++) {\n        dz[i] = p[i];\n    }\n\n    dz[target] -= 1.0f;\n\n\nThen the reported training number should usually come from the target probability, not from the whole output distribution’s entropy:\n\n\n    /*\n     * Accumulate target NLL over characters.\n     */\n    total_nll += -logf(p[target] + 1e-12f);\n    count++;\n\n    mean_nll = total_nll / (float)count;\n\n\nIf that `mean_nll` is in natural-log units, then perplexity is:\n\n\n    /*\n     * Per-character perplexity, if mean_nll is in nats.\n     *\n     * mean_nll = 1.00  ->  ppl ~= exp(1.00) ~= 2.72\n     * mean_nll = 0.94  ->  ppl ~= exp(0.94) ~= 2.56\n     */\n    ppl = expf(mean_nll);\n\n\nThat distinction matters because “stuck around 1.0” means very different things depending on what the number is.\n\nPrinted number | Meaning\n---|---\n`-sum_i p[i] * log(p[i])` | entropy of the model’s own prediction\n`-log(p[target])` | next-character NLL\n`exp(mean_nll)` | perplexity, if `mean_nll` is true target NLL\ntrain NLL only | can show fitting, but not generalization\nheld-out NLL | better signal for whether the model is really improving\n\nSo before changing the architecture, I would probably make the measurement path boring and explicit:\n\n\n    /*\n     * Minimal reporting loop idea.\n     *\n     * Same model.\n     * Same forward pass.\n     * But print different numbers clearly.\n     */\n    train_nll = 0.0f;\n    train_entropy = 0.0f;\n    train_count = 0;\n\n    for each training character {\n        forward();\n\n        /*\n         * The number used as the LM loss.\n         */\n        train_nll += -logf(p[target] + 1e-12f);\n\n        /*\n         * Optional diagnostic only.\n         */\n        for (i = 0; i < Y_S; i++) {\n            train_entropy -= p[i] * logf(p[i] + 1e-12f);\n        }\n\n        train_count++;\n    }\n\n    mean_train_nll = train_nll / (float)train_count;\n    mean_entropy   = train_entropy / (float)train_count;\n    train_ppl      = expf(mean_train_nll);\n\n\nAnd then do the same thing on text not used for weight updates:\n\n\n    /*\n     * Evaluation pass.\n     *\n     * No weight update.\n     * No gradient accumulation needed.\n     * Just forward pass + target NLL.\n     */\n    eval_nll = 0.0f;\n    eval_count = 0;\n\n    for each held_out character {\n        forward();\n\n        eval_nll += -logf(p[target] + 1e-12f);\n        eval_count++;\n    }\n\n    mean_eval_nll = eval_nll / (float)eval_count;\n    eval_ppl      = expf(mean_eval_nll);\n\n\nOnly after that would I trust the curve enough to ask whether the bottleneck is capacity.\n\nIf true held-out NLL is improving and then flattening, then yes, hidden size, gated cells, more data, better corpus separation, or more layers become real questions.\n\nIf the printed number is entropy, then the model might be improving or failing and the number may not tell you which.\n\nThe next wall after measurement is probably gradient scale. Even when the equations are right, vanilla RNN training can be touchy. I would want to know at least whether the gradient norm is exploding or getting clipped:\n\n\n    /*\n     * Separate diagnostic:\n     *\n     * Are the gradients huge?\n     * Are they tiny?\n     * Is the LR fighting the recurrence?\n     */\n    grad_norm = sqrtf(sum_of_squared_gradients);\n\n    if (grad_norm > clip_norm) {\n        scale = clip_norm / (grad_norm + 1e-12f);\n\n        /*\n         * Multiply every gradient buffer by scale.\n         */\n        scale_all_gradients(scale);\n    }\n\n\nSo my order of suspicion would be:\n\nPriority | Check | Why\n---|---|---\n1 | Is the printed loss target NLL? | otherwise the main number may not measure prediction\n2 | Is there held-out NLL/PPL? | otherwise “stuck” is hard to interpret\n3 | Is softmax/log loss numerically stable? | `expf` and `logf` can lie quietly\n4 | Are gradients clipped or at least measured? | vanilla RNNs are sensitive here\n5 | Is LR/optimizer state stable? | manual LR changes can hide other issues\n6 | Is the hidden state boundary correct? | BPTT length and reset/carry choices matter\n7 | Only then: 256 vs 512 / layers / gated cell | capacity is meaningful after the instruments are trustworthy\n\nOn the embedding question: for a 96-character one-hot input, I would not treat “no embedding layer” as the first-order problem. A matrix multiply by a one-hot vector is already acting like selecting one column/row of learned input weights. An explicit embedding layer may be cleaner or more efficient, but I would not expect it to magically fix a bad loss/reporting path.\n\nSo my short version would be:\n\n\n    /*\n     * My guess:\n     *\n     * recurrent part: probably alive\n     * output gradient: probably close now\n     * printed loss: check very carefully\n     * architecture: not the first suspect yet\n     */\n\n\nIf the reported `0.94` is true target NLL in nats/char, that is one kind of result.\n\nIf the reported `0.94` is output entropy, I would not use it yet to decide whether the model needs 512 nodes, another layer, or an embedding layer.\n\nI would first make the instruments say exactly:\n\n\n    mean_train_nll\n    mean_eval_nll\n    train_ppl\n    eval_ppl\n    mean_output_entropy      /* optional diagnostic */\n    grad_norm               /* optional diagnostic */\n\n\nThen the architecture question becomes much less murky.",
  "title": "RNN in C - is this BPTT finally right?"
}