{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid6hmwmr4xss7gk5wboio6rysnv6zcils6kvuxxs4whaxvub7mugq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo2h4lvx4l62"
  },
  "path": "/t/rnn-in-c-is-this-bptt-finally-right/176455#post_10",
  "publishedAt": "2026-06-11T22:30:58.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hm. With the added context and code, the LLM view would probably become something like this?:\n\n* * *\n\nI think the training gradient may be closer than the printed loss.\n\nThis part looks like the right idea for the output gradient:\n\n\n    /*\n     * netout is probability after softmax.\n     * Then this turns it into:\n     *\n     *     dz = p - one_hot(target)\n     */\n    netout[twin][tc] -= 1.f;\n\n\nSo after this line, `netout[twin]` is no longer a probability distribution. It is now the output gradient buffer.\n\nThat means this later line is suspicious:\n\n\n    ebuf -= log(fmax(1e-18f, netout[iter][cin[iter]]));\n\n\nThere are two separate issues there.\n\nFirst, `netout[iter]` has already been changed from `p` into `p - target`.\n\nSecond, `cin[iter]` looks like the input character index, not the target character index:\n\n\n    /*\n     * b  = previous char / current input\n     * tc = current read char / correct next-char target\n     */\n    netin[b] = 1.f;\n    cin[twin] = b;\n\n    /*\n     * But the training target for this step is tc.\n     */\n\n\nSo I think the reporting path should probably be separated from the gradient path.\n\nSomething like:\n\n\n    /*\n     * Store the actual next-character target.\n     */\n    target[twin] = tc;\n\n\nThen, immediately after softmax and before mutating `netout`:\n\n\n    /*\n     * netout[twin][i] is still probability here.\n     *\n     * This is the next-character NLL for this step.\n     */\n    step_nll[twin] = -logf(netout[twin][target[twin]] + 1e-12f);\n\n    total_nll += step_nll[twin];\n    nll_count++;\n\n\nOnly after that should `netout` be changed into the output gradient:\n\n\n    /*\n     * Reuse netout as dz if you want.\n     * But after this point, do not use it as probability.\n     */\n    netout[twin][target[twin]] -= 1.0f;\n\n\nOr, more explicit but more memory:\n\n\n    /*\n     * p  = probability\n     * dz = output gradient\n     */\n    for (i = 0; i < 96; i++) {\n        p[twin][i]  = netout[twin][i];\n        dz[twin][i] = netout[twin][i];\n    }\n\n    dz[twin][target[twin]] -= 1.0f;\n\n\nThen BPTT should consume `dz`, not `p`:\n\n\n    /*\n     * Output layer gradient.\n     */\n    dobias[i] += dz[iter][i];\n\n    for (j = 0; j < hidd; j++) {\n        dnno[j][i] += dz[iter][i] * h[iplus][j];\n    }\n\n\nAnd the hidden gradient should also use `dz`:\n\n\n    /*\n     * dh = Why^T * dz + dh_next\n     */\n    sum = dh_next[i];\n\n    for (j = 0; j < 96; j++) {\n        sum += nno[i][j] * dz[iter][j];\n    }\n\n    dh[i] = sum;\n\n\nSo if you want to keep reusing `netout` as `dz`, I would avoid computing loss from `netout` later.\n\nMaybe the minimal change is:\n\n\n    /*\n     * After softmax normalization:\n     *\n     *     netout[twin][i] == p[i]\n     *\n     * So compute loss now.\n     */\n    target[twin] = tc;\n\n    ebuf += -logf(netout[twin][target[twin]] + 1e-12f);\n\n    /*\n     * Now mutate netout into dz for BPTT.\n     */\n    netout[twin][target[twin]] -= 1.0f;\n\n\nAnd remove this from the BPTT loop:\n\n\n    /*\n     * Suspicious:\n     *\n     * netout is no longer probability here,\n     * and cin[iter] appears to be the input char, not the target char.\n     */\n    ebuf -= log(fmax(1e-18f, netout[iter][cin[iter]]));\n\n\nSo my current guess is not “your BPTT is totally wrong”.\n\nIt is more like:\n\n\n    /*\n     * The backward signal may be roughly right:\n     *\n     *     netout = p - target\n     *\n     * But the printed loss seems to be reading from the wrong object:\n     *\n     *     mutated gradient buffer, not probability\n     *\n     * and possibly from the wrong index:\n     *\n     *     input char, not next-char target\n     */\n\n\nThat would also explain why the printed number behaves oddly while the model can still sometimes produce fragments that look meaningful.\n\nThe model may be learning something, but the instrument panel may still be wired to the wrong signal.",
  "title": "RNN in C - is this BPTT finally right?"
}