External Publication

RNN in C - is this BPTT finally right?

Hugging Face Forums [Unofficial] June 13, 2026

Nice! For now, I just tried to triage the current symptoms:

That sounds like a real step forward.

I would now separate the previous wall from the new wall:

Stage	My read
BPTT form	probably mostly settled
output gradient	probably mostly settled
loss/reporting	much clearer now
current barrier	probably training dynamics / saturation / gradient scale
architecture choice	still relevant, but I would not make it the first suspect yet

So my current guess would change from:

/*
 * previous suspect:
 *
 * are we measuring the right thing?
 */

to:

/*
 * current suspect:
 *
 * is the vanilla tanh recurrence staying in a trainable regime?
 */

The part that catches my eye is this:

the hidden layer dissolves from centric behaviour towards either extremity

For a vanilla tanh RNN, that sounds important. If hidden units drift near +1 or -1, then the local derivative gets small:

/*
 * tanh derivative.
 *
 * If h is near +1 or -1,
 * this factor goes near zero.
 */
dh_raw[i] = (1.0f - h[i] * h[i]) * dh[i];

So even if dh exists, the nonlinearity can shrink dh_raw before it reaches Wxh, Whh, and the earlier time steps.

That does not mean “this is definitely the bug”. It means I would now start measuring this directly.

For example:

/*
 * How much of the hidden state is near the tanh rails?
 */
sat = 0;

for (i = 0; i < hidd; i++) {
    if (fabsf(h[i]) > 0.95f) {
        sat++;
    }
}

sat_ratio = (float)sat / (float)hidd;

And also:

/*
 * Rough hidden-state magnitude.
 *
 * If this keeps rising toward 1.0,
 * the recurrent state may be drifting into saturation.
 */
h_abs_sum = 0.0f;

for (i = 0; i < hidd; i++) {
    h_abs_sum += fabsf(h[i]);
}

mean_abs_h = h_abs_sum / (float)hidd;

I would especially compare dh and dh_raw:

/*
 * If dh_norm is alive but dh_raw_norm is tiny,
 * the tanh nonlinearity is killing the local gradient.
 */
dh_norm = 0.0f;
dh_raw_norm = 0.0f;

for (i = 0; i < hidd; i++) {
    dh_norm     += dh[i] * dh[i];
    dh_raw_norm += dh_raw[i] * dh_raw[i];
}

dh_norm     = sqrtf(dh_norm);
dh_raw_norm = sqrtf(dh_raw_norm);

That gives a more specific picture than only looking at the loss curve.

The 512 result also makes me think this may not be pure capacity. If 512 drops faster at first but later behaves about like 256, then more units helped early fitting speed, but may not have changed the main limiting mechanism.

I would read that as:

/*
 * 512 helped the model move faster early.
 * It did not obviously remove the barrier.
 *
 * So the barrier may be:
 *
 *     saturation,
 *     gradient scale,
 *     update scale,
 *     recurrence scale,
 *     optimizer behaviour,
 *     hidden-state handling,
 *
 * not only number of hidden units.
 */

The Adam observation also points that way. If Adam takes longer but keeps making progress, I would read that as a clue that the step scale matters. Not necessarily “Adam is the answer”, but possibly:

/*
 * If adaptive steps keep moving where manual LR modulation stalls,
 * the problem may involve update scale, not just model size.
 */

I would also distinguish elementwise clipping from norm diagnostics.

Elementwise clipping is an emergency brake:

/*
 * Elementwise hard clipping.
 * Useful, but each component is capped independently.
 */
if (g[i] >  5.0f) g[i] =  5.0f;
if (g[i] < -5.0f) g[i] = -5.0f;

But I would still want to know the whole gradient norm:

/*
 * Diagnostic:
 * is the whole gradient vector exploding?
 */
grad_norm = sqrtf(sum_squared_gradients);

And possibly try norm clipping:

/*
 * Norm clipping:
 * scale the whole gradient vector if it is too large.
 */
if (grad_norm > clip_norm) {
    scale = clip_norm / (grad_norm + 1e-12f);

    /*
     * multiply every gradient buffer by scale
     */
    scale_all_gradients(scale);
}

Another useful number is the update-to-weight ratio:

/*
 * If this is large, the update may be too aggressive.
 * If this is tiny, learning may be crawling.
 */
update_norm = sqrtf(sum_squared_updates);
weight_norm = sqrtf(sum_squared_weights);

update_ratio = update_norm / (weight_norm + 1e-12f);

So my triage table would be:

Symptom	First thing I would suspect	First thing I would measure
hidden state moves toward extremes	tanh saturation	`sat_ratio`, `mean_abs_h`
512 drops faster, then stalls similarly	not pure capacity	compare saturation and gradient norms
Adam is slower but keeps moving	update scale matters	update/weight ratio
regularization increases error	maybe not overfitting yet	train/eval NLL gap
LR modulation hits a wall	optimizer/scale issue	gradient norm, update ratio
second layer seems tempting	reasonable later experiment	first-layer saturation first

A second hidden layer might be a reasonable experiment, but I would probably not make it the next diagnostic step.

If the first recurrent layer is already pushing tanh states toward the rails, a second layer may add capacity, but it also adds another place for gradients to shrink, explode, or become hard to interpret.

So before adding that layer, I would try to print something like:

/*
 * Current instrument panel.
 */
mean_train_nll
mean_eval_nll

mean_abs_h
sat_ratio

dh_norm
dh_raw_norm

grad_norm
update_ratio

Then the next decision becomes clearer:

/*
 * If saturation is high:
 *     reduce recurrent/input scale, adjust LR, check clipping,
 *     maybe consider gated cells later.
 *
 * If gradients explode:
 *     norm clipping / smaller LR / optimizer changes.
 *
 * If train improves but eval does not:
 *     regularization / data split / capacity questions.
 *
 * If none of those are bad:
 *     then 512, second layer, or a different cell become cleaner tests.
 */

So I would call the previous issue mostly solved.

The new barrier looks less like:

/*
 * Is the BPTT formula right?
 */

and more like:

/*
 * Is this vanilla tanh RNN staying in a regime where it can keep learning?
 */

That is a good next problem to have. It means the code is no longer just failing at the wiring level; it is starting to show the ordinary dynamics problems that vanilla recurrent nets tend to expose.

Discussion in the ATmosphere