Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih2krr66iauh7e2mtqenw6g3ladjnuo343egp7ts7eunzscapcnqm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo7bi5vg6sn2"
  },
  "path": "/t/rnn-in-c-is-this-bptt-finally-right/176455#post_12",
  "publishedAt": "2026-06-13T21:56:36.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Nice! For now, I just tried to triage the current symptoms:\n\n* * *\n\nThat sounds like a real step forward.\n\nI would now separate the previous wall from the new wall:\n\nStage | My read\n---|---\nBPTT form | probably mostly settled\noutput gradient | probably mostly settled\nloss/reporting | much clearer now\ncurrent barrier | probably training dynamics / saturation / gradient scale\narchitecture choice | still relevant, but I would not make it the first suspect yet\n\nSo my current guess would change from:\n\n\n    /*\n     * previous suspect:\n     *\n     * are we measuring the right thing?\n     */\n\n\nto:\n\n\n    /*\n     * current suspect:\n     *\n     * is the vanilla tanh recurrence staying in a trainable regime?\n     */\n\n\nThe part that catches my eye is this:\n\n> the hidden layer dissolves from centric behaviour towards either extremity\n\nFor a vanilla `tanh` RNN, that sounds important. If hidden units drift near `+1` or `-1`, then the local derivative gets small:\n\n\n    /*\n     * tanh derivative.\n     *\n     * If h is near +1 or -1,\n     * this factor goes near zero.\n     */\n    dh_raw[i] = (1.0f - h[i] * h[i]) * dh[i];\n\n\nSo even if `dh` exists, the nonlinearity can shrink `dh_raw` before it reaches `Wxh`, `Whh`, and the earlier time steps.\n\nThat does not mean “this is definitely the bug”. It means I would now start measuring this directly.\n\nFor example:\n\n\n    /*\n     * How much of the hidden state is near the tanh rails?\n     */\n    sat = 0;\n\n    for (i = 0; i < hidd; i++) {\n        if (fabsf(h[i]) > 0.95f) {\n            sat++;\n        }\n    }\n\n    sat_ratio = (float)sat / (float)hidd;\n\n\nAnd also:\n\n\n    /*\n     * Rough hidden-state magnitude.\n     *\n     * If this keeps rising toward 1.0,\n     * the recurrent state may be drifting into saturation.\n     */\n    h_abs_sum = 0.0f;\n\n    for (i = 0; i < hidd; i++) {\n        h_abs_sum += fabsf(h[i]);\n    }\n\n    mean_abs_h = h_abs_sum / (float)hidd;\n\n\nI would especially compare `dh` and `dh_raw`:\n\n\n    /*\n     * If dh_norm is alive but dh_raw_norm is tiny,\n     * the tanh nonlinearity is killing the local gradient.\n     */\n    dh_norm = 0.0f;\n    dh_raw_norm = 0.0f;\n\n    for (i = 0; i < hidd; i++) {\n        dh_norm     += dh[i] * dh[i];\n        dh_raw_norm += dh_raw[i] * dh_raw[i];\n    }\n\n    dh_norm     = sqrtf(dh_norm);\n    dh_raw_norm = sqrtf(dh_raw_norm);\n\n\nThat gives a more specific picture than only looking at the loss curve.\n\nThe 512 result also makes me think this may not be pure capacity. If 512 drops faster at first but later behaves about like 256, then more units helped early fitting speed, but may not have changed the main limiting mechanism.\n\nI would read that as:\n\n\n    /*\n     * 512 helped the model move faster early.\n     * It did not obviously remove the barrier.\n     *\n     * So the barrier may be:\n     *\n     *     saturation,\n     *     gradient scale,\n     *     update scale,\n     *     recurrence scale,\n     *     optimizer behaviour,\n     *     hidden-state handling,\n     *\n     * not only number of hidden units.\n     */\n\n\nThe Adam observation also points that way. If Adam takes longer but keeps making progress, I would read that as a clue that the step scale matters. Not necessarily “Adam is the answer”, but possibly:\n\n\n    /*\n     * If adaptive steps keep moving where manual LR modulation stalls,\n     * the problem may involve update scale, not just model size.\n     */\n\n\nI would also distinguish elementwise clipping from norm diagnostics.\n\nElementwise clipping is an emergency brake:\n\n\n    /*\n     * Elementwise hard clipping.\n     * Useful, but each component is capped independently.\n     */\n    if (g[i] >  5.0f) g[i] =  5.0f;\n    if (g[i] < -5.0f) g[i] = -5.0f;\n\n\nBut I would still want to know the whole gradient norm:\n\n\n    /*\n     * Diagnostic:\n     * is the whole gradient vector exploding?\n     */\n    grad_norm = sqrtf(sum_squared_gradients);\n\n\nAnd possibly try norm clipping:\n\n\n    /*\n     * Norm clipping:\n     * scale the whole gradient vector if it is too large.\n     */\n    if (grad_norm > clip_norm) {\n        scale = clip_norm / (grad_norm + 1e-12f);\n\n        /*\n         * multiply every gradient buffer by scale\n         */\n        scale_all_gradients(scale);\n    }\n\n\nAnother useful number is the update-to-weight ratio:\n\n\n    /*\n     * If this is large, the update may be too aggressive.\n     * If this is tiny, learning may be crawling.\n     */\n    update_norm = sqrtf(sum_squared_updates);\n    weight_norm = sqrtf(sum_squared_weights);\n\n    update_ratio = update_norm / (weight_norm + 1e-12f);\n\n\nSo my triage table would be:\n\nSymptom | First thing I would suspect | First thing I would measure\n---|---|---\nhidden state moves toward extremes | tanh saturation | `sat_ratio`, `mean_abs_h`\n512 drops faster, then stalls similarly | not pure capacity | compare saturation and gradient norms\nAdam is slower but keeps moving | update scale matters | update/weight ratio\nregularization increases error | maybe not overfitting yet | train/eval NLL gap\nLR modulation hits a wall | optimizer/scale issue | gradient norm, update ratio\nsecond layer seems tempting | reasonable later experiment | first-layer saturation first\n\nA second hidden layer might be a reasonable experiment, but I would probably not make it the next diagnostic step.\n\nIf the first recurrent layer is already pushing `tanh` states toward the rails, a second layer may add capacity, but it also adds another place for gradients to shrink, explode, or become hard to interpret.\n\nSo before adding that layer, I would try to print something like:\n\n\n    /*\n     * Current instrument panel.\n     */\n    mean_train_nll\n    mean_eval_nll\n\n    mean_abs_h\n    sat_ratio\n\n    dh_norm\n    dh_raw_norm\n\n    grad_norm\n    update_ratio\n\n\nThen the next decision becomes clearer:\n\n\n    /*\n     * If saturation is high:\n     *     reduce recurrent/input scale, adjust LR, check clipping,\n     *     maybe consider gated cells later.\n     *\n     * If gradients explode:\n     *     norm clipping / smaller LR / optimizer changes.\n     *\n     * If train improves but eval does not:\n     *     regularization / data split / capacity questions.\n     *\n     * If none of those are bad:\n     *     then 512, second layer, or a different cell become cleaner tests.\n     */\n\n\nSo I would call the previous issue mostly solved.\n\nThe new barrier looks less like:\n\n\n    /*\n     * Is the BPTT formula right?\n     */\n\n\nand more like:\n\n\n    /*\n     * Is this vanilla tanh RNN staying in a regime where it can keep learning?\n     */\n\n\nThat is a good next problem to have. It means the code is no longer just failing at the wiring level; it is starting to show the ordinary dynamics problems that vanilla recurrent nets tend to expose.",
  "title": "RNN in C - is this BPTT finally right?"
}