RNN in C - is this BPTT finally right?
Nice! For now, I just tried to triage the current symptoms:
That sounds like a real step forward.
I would now separate the previous wall from the new wall:
| Stage | My read |
|---|---|
| BPTT form | probably mostly settled |
| output gradient | probably mostly settled |
| loss/reporting | much clearer now |
| current barrier | probably training dynamics / saturation / gradient scale |
| architecture choice | still relevant, but I would not make it the first suspect yet |
So my current guess would change from:
/*
* previous suspect:
*
* are we measuring the right thing?
*/
to:
/*
* current suspect:
*
* is the vanilla tanh recurrence staying in a trainable regime?
*/
The part that catches my eye is this:
the hidden layer dissolves from centric behaviour towards either extremity
For a vanilla tanh RNN, that sounds important. If hidden units drift near +1 or -1, then the local derivative gets small:
/*
* tanh derivative.
*
* If h is near +1 or -1,
* this factor goes near zero.
*/
dh_raw[i] = (1.0f - h[i] * h[i]) * dh[i];
So even if dh exists, the nonlinearity can shrink dh_raw before it reaches Wxh, Whh, and the earlier time steps.
That does not mean “this is definitely the bug”. It means I would now start measuring this directly.
For example:
/*
* How much of the hidden state is near the tanh rails?
*/
sat = 0;
for (i = 0; i < hidd; i++) {
if (fabsf(h[i]) > 0.95f) {
sat++;
}
}
sat_ratio = (float)sat / (float)hidd;
And also:
/*
* Rough hidden-state magnitude.
*
* If this keeps rising toward 1.0,
* the recurrent state may be drifting into saturation.
*/
h_abs_sum = 0.0f;
for (i = 0; i < hidd; i++) {
h_abs_sum += fabsf(h[i]);
}
mean_abs_h = h_abs_sum / (float)hidd;
I would especially compare dh and dh_raw:
/*
* If dh_norm is alive but dh_raw_norm is tiny,
* the tanh nonlinearity is killing the local gradient.
*/
dh_norm = 0.0f;
dh_raw_norm = 0.0f;
for (i = 0; i < hidd; i++) {
dh_norm += dh[i] * dh[i];
dh_raw_norm += dh_raw[i] * dh_raw[i];
}
dh_norm = sqrtf(dh_norm);
dh_raw_norm = sqrtf(dh_raw_norm);
That gives a more specific picture than only looking at the loss curve.
The 512 result also makes me think this may not be pure capacity. If 512 drops faster at first but later behaves about like 256, then more units helped early fitting speed, but may not have changed the main limiting mechanism.
I would read that as:
/*
* 512 helped the model move faster early.
* It did not obviously remove the barrier.
*
* So the barrier may be:
*
* saturation,
* gradient scale,
* update scale,
* recurrence scale,
* optimizer behaviour,
* hidden-state handling,
*
* not only number of hidden units.
*/
The Adam observation also points that way. If Adam takes longer but keeps making progress, I would read that as a clue that the step scale matters. Not necessarily “Adam is the answer”, but possibly:
/*
* If adaptive steps keep moving where manual LR modulation stalls,
* the problem may involve update scale, not just model size.
*/
I would also distinguish elementwise clipping from norm diagnostics.
Elementwise clipping is an emergency brake:
/*
* Elementwise hard clipping.
* Useful, but each component is capped independently.
*/
if (g[i] > 5.0f) g[i] = 5.0f;
if (g[i] < -5.0f) g[i] = -5.0f;
But I would still want to know the whole gradient norm:
/*
* Diagnostic:
* is the whole gradient vector exploding?
*/
grad_norm = sqrtf(sum_squared_gradients);
And possibly try norm clipping:
/*
* Norm clipping:
* scale the whole gradient vector if it is too large.
*/
if (grad_norm > clip_norm) {
scale = clip_norm / (grad_norm + 1e-12f);
/*
* multiply every gradient buffer by scale
*/
scale_all_gradients(scale);
}
Another useful number is the update-to-weight ratio:
/*
* If this is large, the update may be too aggressive.
* If this is tiny, learning may be crawling.
*/
update_norm = sqrtf(sum_squared_updates);
weight_norm = sqrtf(sum_squared_weights);
update_ratio = update_norm / (weight_norm + 1e-12f);
So my triage table would be:
| Symptom | First thing I would suspect | First thing I would measure |
|---|---|---|
| hidden state moves toward extremes | tanh saturation | sat_ratio, mean_abs_h |
| 512 drops faster, then stalls similarly | not pure capacity | compare saturation and gradient norms |
| Adam is slower but keeps moving | update scale matters | update/weight ratio |
| regularization increases error | maybe not overfitting yet | train/eval NLL gap |
| LR modulation hits a wall | optimizer/scale issue | gradient norm, update ratio |
| second layer seems tempting | reasonable later experiment | first-layer saturation first |
A second hidden layer might be a reasonable experiment, but I would probably not make it the next diagnostic step.
If the first recurrent layer is already pushing tanh states toward the rails, a second layer may add capacity, but it also adds another place for gradients to shrink, explode, or become hard to interpret.
So before adding that layer, I would try to print something like:
/*
* Current instrument panel.
*/
mean_train_nll
mean_eval_nll
mean_abs_h
sat_ratio
dh_norm
dh_raw_norm
grad_norm
update_ratio
Then the next decision becomes clearer:
/*
* If saturation is high:
* reduce recurrent/input scale, adjust LR, check clipping,
* maybe consider gated cells later.
*
* If gradients explode:
* norm clipping / smaller LR / optimizer changes.
*
* If train improves but eval does not:
* regularization / data split / capacity questions.
*
* If none of those are bad:
* then 512, second layer, or a different cell become cleaner tests.
*/
So I would call the previous issue mostly solved.
The new barrier looks less like:
/*
* Is the BPTT formula right?
*/
and more like:
/*
* Is this vanilla tanh RNN staying in a regime where it can keep learning?
*/
That is a good next problem to have. It means the code is no longer just failing at the wiring level; it is starting to show the ordinary dynamics problems that vanilla recurrent nets tend to expose.
Discussion in the ATmosphere