Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifbaguucdooifaxdjrabjqvrvxnvta3i4iimwrw7k54otq3dlnnyq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnbhljaem362"
  },
  "path": "/t/rnn-in-c-is-this-bptt-finally-right/176455#post_2",
  "publishedAt": "2026-06-02T01:35:44.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "min-char-rnn.py",
    "CS231n’s softmax notes",
    "Numerically Stable Softmax and Cross Entropy",
    "CS231n: Training a Softmax Linear Classifier",
    "Dive into Deep Learning: Backpropagation Through Time",
    "D2L BPTT",
    "CrossEntropyLoss",
    "gradcheck",
    "taey16/min_char_rnn",
    "Andrej Karpathy — min-char-rnn.py",
    "CS231n — Softmax classifier",
    "CS231n — softmax gradient in the neural-net case study",
    "Jay Mody — Numerically Stable Softmax and Cross Entropy",
    "Dive into Deep Learning — Backpropagation Through Time",
    "PyTorch — CrossEntropyLoss",
    "PyTorch — gradcheck",
    "taey16 — min_char_rnn with gradient checking"
  ],
  "textContent": "Hmm, maybe something like this…?\n\n* * *\n\nI think the BPTT part is _close_ , but I would not call this “finally right” yet. The main issue I see is not really the recurrent part of BPTT; it is the output layer / softmax / cross-entropy path.\n\nThe short version:\n\nPart | My read\n---|---\nHidden-state recurrence | Looks structurally reasonable\n`dh = Why^T * dy + dh_next` | Correct idea\n`dh_raw = (1 - h^2) * dh` | Correct for `tanh`\n`dWxh += dh_raw * x[t]` | Correct idea\n`dWhh += dh_raw * h[t]` | Correct if your indexing is `h[t+1] = f(x[t], h[t])`\n`dh_next = Whh^T * dh_raw` | Correct idea\nSoftmax forward | Looks wrong\nCross-entropy gradient | Looks wrong because the softmax value is wrong\nFinal confidence | Needs a small gradient check\n\nThe biggest red flag is that `y` appears to be used as multiple different things: logits, exponentiated logits, and something like a softmax output. I would split it into two arrays:\n\n\n    float z[T][Y_S];  // logits, raw output scores\n    float p[T][Y_S];  // softmax probabilities\n\n\nThat one naming change tends to eliminate a lot of bugs.\n\n## 1. The core bug: logits, exp(logits), and probabilities are mixed\n\nThis part looks suspicious:\n\n\n    y[t][i] = exp(sum);\n    ssum += sum;\n\n\nFor softmax, if `sum` is the logit, the denominator should accumulate `exp(sum)`, not `sum`.\n\nThe usual softmax is:\n\n\n    p_i = exp(z_i) / sum_j exp(z_j)\n\n\nSo if you store `exp(sum)` in `y[t][i]`, then the denominator would need to sum those exponentials. But then later this happens:\n\n\n    softmax_out = expf(y[t][i]) / softsum[t];\n\n\nThat means you are exponentiating `y[t][i]` again. If `y[t][i]` already contains `exp(logit)`, this becomes something like:\n\n\n    exp(exp(logit))\n\n\nThat is not softmax.\n\nA cleaner implementation is:\n\n\n    z[t][i] = sum;        // raw logit\n    p[t][i] = softmax(z); // probability\n\n\nThis is also the convention used in Andrej Karpathy’s minimal character RNN. In min-char-rnn.py, the forward pass separates:\n\n\n    ys[t] = np.dot(Why, hs[t]) + by\n    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))\n    loss += -np.log(ps[t][targets[t], 0])\n\n\nHere:\n\nKarpathy variable | Meaning\n---|---\n`ys[t]` | logits / unnormalized scores\n`ps[t]` | softmax probabilities\n`targets[t]` | correct class index\n\nThat separation is very useful. I would copy that idea directly into the C version.\n\n## 2. Use a numerically stable softmax\n\nEven after fixing the logic, I would not write the naive softmax in C. Use the standard max-subtraction trick.\n\nBoth CS231n’s softmax notes and Jay Mody’s Numerically Stable Softmax and Cross Entropy explain the same point: exponentials can overflow, and subtracting the maximum logit gives an equivalent but much safer computation.\n\nUse something like this:\n\n\n    // z[t][i] already contains the logits for time step t.\n    // p[t][i] will contain the probabilities.\n\n    float max_z = z[t][0];\n\n    for (int i = 1; i < Y_S; i++) {\n        if (z[t][i] > max_z) {\n            max_z = z[t][i];\n        }\n    }\n\n    float denom = 0.0f;\n\n    for (int i = 0; i < Y_S; i++) {\n        p[t][i] = expf(z[t][i] - max_z);\n        denom += p[t][i];\n    }\n\n    for (int i = 0; i < Y_S; i++) {\n        p[t][i] /= denom;\n    }\n\n\nThis has a few advantages:\n\nProblem | Why this helps\n---|---\nOverflow in `expf` | The largest exponent becomes `expf(0) = 1`\nDivision by zero | At least one denominator term is 1\nAmbiguous variable meaning | `z` is logits, `p` is probability\nEasier backward pass | `dy = p - one_hot(target)`\n\nYou can also compute the loss stably with log-sum-exp:\n\n\n    float log_denom = logf(denom) + max_z;\n    loss += -z[t][target[t]] + log_denom;\n\n\nor, if you already computed `p`:\n\n\n    loss += -logf(p[t][target[t]] + 1e-12f);\n\n\nThe log-sum-exp version is cleaner numerically.\n\n## 3. The output gradient should be `p - one_hot(target)`\n\nFor softmax + cross-entropy, the gradient with respect to the logits is the classic:\n\n\n    dy = p;\n    dy[target] -= 1.0f;\n\n\nThis is what Karpathy’s implementation does:\n\n\n    dy = np.copy(ps[t])\n    dy[targets[t]] -= 1\n\n\nSee also the CS231n neural-net case study section on the softmax gradient: CS231n: Training a Softmax Linear Classifier.\n\nIn C, I would write:\n\n\n    float dy[Y_S];\n\n    for (int i = 0; i < Y_S; i++) {\n        dy[i] = p[t][i];\n    }\n\n    dy[target[t]] -= 1.0f;\n\n\nThen the output-layer gradients are:\n\n\n    for (int i = 0; i < Y_S; i++) {\n        dby[i] += dy[i];\n\n        for (int j = 0; j < H_S; j++) {\n            dWhy[i][j] += dy[i] * h[t + 1][j];\n        }\n    }\n\n\nThis assumes your forward recurrence is:\n\n\n    h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);\n    z[t]     = Why * h[t + 1] + by;\n    p[t]     = softmax(z[t]);\n\n\nThat indexing convention is fine. It just means `h[t]` is the previous state and `h[t+1]` is the current state.\n\n## 4. Your BPTT recurrence looks mostly right\n\nGiven this forward pass:\n\n\n    h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);\n    z[t]     = Why * h[t + 1] + by;\n    p[t]     = softmax(z[t]);\n\n\nThe backward pass should look like this:\n\n\n    for (int t = T - 1; t >= 0; t--) {\n        // 1. Output gradient\n        for (int i = 0; i < Y_S; i++) {\n            dy[i] = p[t][i];\n        }\n        dy[target[t]] -= 1.0f;\n\n        // 2. Gradients for Why and by\n        for (int i = 0; i < Y_S; i++) {\n            dby[i] += dy[i];\n\n            for (int j = 0; j < H_S; j++) {\n                dWhy[i][j] += dy[i] * h[t + 1][j];\n            }\n        }\n\n        // 3. Backprop into current hidden state\n        for (int i = 0; i < H_S; i++) {\n            float sum = dh_next[i];\n\n            for (int j = 0; j < Y_S; j++) {\n                sum += Why[j][i] * dy[j];\n            }\n\n            dh[i] = sum;\n        }\n\n        // 4. Backprop through tanh\n        for (int i = 0; i < H_S; i++) {\n            dh_raw[i] = (1.0f - h[t + 1][i] * h[t + 1][i]) * dh[i];\n            dbh[i] += dh_raw[i];\n        }\n\n        // 5. Gradients for Wxh and Whh\n        for (int i = 0; i < H_S; i++) {\n            for (int j = 0; j < X_S; j++) {\n                dWxh[i][j] += dh_raw[i] * x[t][j];\n            }\n\n            for (int j = 0; j < H_S; j++) {\n                dWhh[i][j] += dh_raw[i] * h[t][j];\n            }\n        }\n\n        // 6. Propagate hidden gradient to previous time step\n        for (int i = 0; i < H_S; i++) {\n            float sum = 0.0f;\n\n            for (int j = 0; j < H_S; j++) {\n                sum += Whh[j][i] * dh_raw[j];\n            }\n\n            dh_next[i] = sum;\n        }\n    }\n\n\nThis matches the same structure as Karpathy’s minimal RNN:\n\n\n    dWhy += np.dot(dy, hs[t].T)\n    dby += dy\n    dh = np.dot(Why.T, dy) + dhnext\n    dhraw = (1 - hs[t] * hs[t]) * dh\n    dbh += dhraw\n    dWxh += np.dot(dhraw, xs[t].T)\n    dWhh += np.dot(dhraw, hs[t-1].T)\n    dhnext = np.dot(Whh.T, dhraw)\n\n\nThe indexing difference is only this:\n\nConcept | Karpathy | Your indexing\n---|---|---\nprevious hidden | `hs[t-1]` | `h[t]`\ncurrent hidden | `hs[t]` | `h[t+1]`\noutput logits | `ys[t]` | `z[t]`\noutput probabilities | `ps[t]` | `p[t]`\n\nSo the recurrent part looks conceptually right.\n\n## 5. Why the `+=` accumulation is correct\n\nThe `+=` on `dWxh`, `dWhh`, `dWhy`, `dbh`, and `dby` is not just an implementation detail. It is required.\n\nIn BPTT, the RNN is unrolled over time, but the parameters are shared at every time step. Therefore each occurrence contributes to the same parameter gradient. Dive into Deep Learning: Backpropagation Through Time explains this as standard backpropagation on the unrolled computational graph, with gradients summed across all repeated occurrences of the same parameter.\n\nSo these are correct in spirit:\n\n\n    dWhy[i][j] += dy[i] * h[t + 1][j];\n    dWxh[i][j] += dh_raw[i] * x[t][j];\n    dWhh[i][j] += dh_raw[i] * h[t][j];\n    dbh[i]     += dh_raw[i];\n    dby[i]     += dy[i];\n\n\nThe important point is that the values being accumulated must be correct. Right now the softmax path makes `dy` questionable, so it contaminates the rest of the gradients.\n\n## 6. Reset gradient accumulators every update\n\nBefore each sequence/minibatch update, make sure all gradient arrays are zeroed:\n\n\n    memset(dWxh, 0, sizeof dWxh);\n    memset(dWhh, 0, sizeof dWhh);\n    memset(dWhy, 0, sizeof dWhy);\n    memset(dbh,  0, sizeof dbh);\n    memset(dby,  0, sizeof dby);\n\n\nAlso reset `dh_next` before the backward loop:\n\n\n    for (int i = 0; i < H_S; i++) {\n        dh_next[i] = 0.0f;\n    }\n\n\nOtherwise the gradient from a previous backward pass can leak into the current one.\n\n## 7. Add gradient clipping\n\nVanilla RNNs are very prone to exploding gradients. Karpathy’s `min-char-rnn.py` clips every gradient array:\n\n\n    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:\n        np.clip(dparam, -5, 5, out=dparam)\n\n\nD2L also points out that long sequences create both computational and numerical instability, including exploding/vanishing gradients: D2L BPTT.\n\nIn C:\n\n\n    static inline float clipf(float v, float lo, float hi) {\n        if (v < lo) return lo;\n        if (v > hi) return hi;\n        return v;\n    }\n\n\nThen apply it to all gradient arrays before the parameter update:\n\n\n    dWxh[i][j] = clipf(dWxh[i][j], -5.0f, 5.0f);\n    dWhh[i][j] = clipf(dWhh[i][j], -5.0f, 5.0f);\n    dWhy[i][j] = clipf(dWhy[i][j], -5.0f, 5.0f);\n    dbh[i]     = clipf(dbh[i],     -5.0f, 5.0f);\n    dby[i]     = clipf(dby[i],     -5.0f, 5.0f);\n\n\n## 8. Use `CrossEntropyLoss` conventions as a sanity check\n\nA useful mental model is PyTorch’s CrossEntropyLoss. PyTorch expects unnormalized logits as input, not already-softmaxed probabilities. It internally combines log-softmax and negative log-likelihood.\n\nFor a manual C implementation, that means:\n\nStage | Store\n---|---\nOutput affine layer | logits `z[t][i]`\nSoftmax | probabilities `p[t][i]`\nLoss | `-log(p[t][target[t]])` or stable log-sum-exp\nBackward into logits | `dy[i] = p[t][i] - one_hot(target[t])`\n\nDo not do:\n\n\n    prob = softmax(exp(logit));\n\n\nDo:\n\n\n    prob = softmax(logit);\n\n\n## 9. The fastest way to know: gradient check\n\nAfter fixing the softmax path, I would not rely on visual inspection. Add a tiny finite-difference gradient check.\n\nUse a very small model:\n\n\n    #define T   2\n    #define X_S 3\n    #define H_S 4\n    #define Y_S 3\n\n\nThen check a few parameters from each group:\n\n\n    float eps = 1e-3f;\n\n    float old = Wxh[0][0];\n\n    Wxh[0][0] = old + eps;\n    float loss_plus = forward_loss_only();\n\n    Wxh[0][0] = old - eps;\n    float loss_minus = forward_loss_only();\n\n    Wxh[0][0] = old;\n\n    float numerical = (loss_plus - loss_minus) / (2.0f * eps);\n    float analytic = dWxh[0][0];\n\n    float relerr = fabsf(numerical - analytic) /\n                   fmaxf(1e-8f, fabsf(numerical) + fabsf(analytic));\n\n\nPyTorch’s gradcheck is basically the same idea: compare analytical gradients against small finite-difference estimates. There is also a useful Karpathy-style gist with gradient checking added: taey16/min_char_rnn.\n\nI would test:\n\nParameter | What failure suggests\n---|---\n`by` | output gradient / softmax / target indexing bug\n`Why` | output gradient or current hidden indexing bug\n`bh` | tanh backward bug\n`Wxh` | input-to-hidden gradient bug\n`Whh` | recurrent gradient or previous hidden indexing bug\n\nIf `by` and `Why` fail first, the bug is almost certainly in the output layer. If those pass but `Wxh`/`Whh` fail, then look at `dh_next`, `dh_raw`, or the `h[t]` vs `h[t+1]` indexing.\n\n## 10. Minimal corrected structure\n\nPutting the important parts together, the forward pass should be shaped like this:\n\n\n    // h[0] should contain the initial hidden state.\n\n    for (int t = 0; t < T; t++) {\n        // Hidden state\n        for (int i = 0; i < H_S; i++) {\n            float sum = bh[i];\n\n            for (int j = 0; j < X_S; j++) {\n                sum += Wxh[i][j] * x[t][j];\n            }\n\n            for (int j = 0; j < H_S; j++) {\n                sum += Whh[i][j] * h[t][j];\n            }\n\n            h[t + 1][i] = tanhf(sum);\n        }\n\n        // Logits\n        for (int i = 0; i < Y_S; i++) {\n            float sum = by[i];\n\n            for (int j = 0; j < H_S; j++) {\n                sum += Why[i][j] * h[t + 1][j];\n            }\n\n            z[t][i] = sum;\n        }\n\n        // Stable softmax\n        float max_z = z[t][0];\n\n        for (int i = 1; i < Y_S; i++) {\n            if (z[t][i] > max_z) {\n                max_z = z[t][i];\n            }\n        }\n\n        float denom = 0.0f;\n\n        for (int i = 0; i < Y_S; i++) {\n            p[t][i] = expf(z[t][i] - max_z);\n            denom += p[t][i];\n        }\n\n        for (int i = 0; i < Y_S; i++) {\n            p[t][i] /= denom;\n        }\n\n        // Loss\n        loss += -logf(p[t][target[t]] + 1e-12f);\n    }\n\n\nAnd the backward pass should begin from:\n\n\n    for (int i = 0; i < Y_S; i++) {\n        dy[i] = p[t][i];\n    }\n\n    dy[target[t]] -= 1.0f;\n\n\nThen continue with the BPTT recurrence you already mostly have.\n\n## Final verdict\n\nI would phrase it this way:\n\n> The recurrent/BPTT skeleton is mostly right, but the output-layer math is not. Fix the softmax path first. Store raw logits separately from probabilities. Use stable softmax. Use `dy = p - one_hot(target)` for softmax + cross-entropy. Then run a small gradient check before trusting the implementation.\n\nSo, not “finally right” yet, but probably not far off.\n\nUseful references:\n\n  * Andrej Karpathy — min-char-rnn.py\n  * CS231n — Softmax classifier\n  * CS231n — softmax gradient in the neural-net case study\n  * Jay Mody — Numerically Stable Softmax and Cross Entropy\n  * Dive into Deep Learning — Backpropagation Through Time\n  * PyTorch — CrossEntropyLoss\n  * PyTorch — gradcheck\n  * taey16 — min_char_rnn with gradient checking\n\n",
  "title": "RNN in C - is this BPTT finally right?"
}