External Publication
Visit Post

RNN in C - is this BPTT finally right?

Hugging Face Forums [Unofficial] June 11, 2026
Source

Hm. With the added context and code, the LLM view would probably become something like this?:


I think the training gradient may be closer than the printed loss.

This part looks like the right idea for the output gradient:

/*
 * netout is probability after softmax.
 * Then this turns it into:
 *
 *     dz = p - one_hot(target)
 */
netout[twin][tc] -= 1.f;

So after this line, netout[twin] is no longer a probability distribution. It is now the output gradient buffer.

That means this later line is suspicious:

ebuf -= log(fmax(1e-18f, netout[iter][cin[iter]]));

There are two separate issues there.

First, netout[iter] has already been changed from p into p - target.

Second, cin[iter] looks like the input character index, not the target character index:

/*
 * b  = previous char / current input
 * tc = current read char / correct next-char target
 */
netin[b] = 1.f;
cin[twin] = b;

/*
 * But the training target for this step is tc.
 */

So I think the reporting path should probably be separated from the gradient path.

Something like:

/*
 * Store the actual next-character target.
 */
target[twin] = tc;

Then, immediately after softmax and before mutating netout:

/*
 * netout[twin][i] is still probability here.
 *
 * This is the next-character NLL for this step.
 */
step_nll[twin] = -logf(netout[twin][target[twin]] + 1e-12f);

total_nll += step_nll[twin];
nll_count++;

Only after that should netout be changed into the output gradient:

/*
 * Reuse netout as dz if you want.
 * But after this point, do not use it as probability.
 */
netout[twin][target[twin]] -= 1.0f;

Or, more explicit but more memory:

/*
 * p  = probability
 * dz = output gradient
 */
for (i = 0; i < 96; i++) {
    p[twin][i]  = netout[twin][i];
    dz[twin][i] = netout[twin][i];
}

dz[twin][target[twin]] -= 1.0f;

Then BPTT should consume dz, not p:

/*
 * Output layer gradient.
 */
dobias[i] += dz[iter][i];

for (j = 0; j < hidd; j++) {
    dnno[j][i] += dz[iter][i] * h[iplus][j];
}

And the hidden gradient should also use dz:

/*
 * dh = Why^T * dz + dh_next
 */
sum = dh_next[i];

for (j = 0; j < 96; j++) {
    sum += nno[i][j] * dz[iter][j];
}

dh[i] = sum;

So if you want to keep reusing netout as dz, I would avoid computing loss from netout later.

Maybe the minimal change is:

/*
 * After softmax normalization:
 *
 *     netout[twin][i] == p[i]
 *
 * So compute loss now.
 */
target[twin] = tc;

ebuf += -logf(netout[twin][target[twin]] + 1e-12f);

/*
 * Now mutate netout into dz for BPTT.
 */
netout[twin][target[twin]] -= 1.0f;

And remove this from the BPTT loop:

/*
 * Suspicious:
 *
 * netout is no longer probability here,
 * and cin[iter] appears to be the input char, not the target char.
 */
ebuf -= log(fmax(1e-18f, netout[iter][cin[iter]]));

So my current guess is not “your BPTT is totally wrong”.

It is more like:

/*
 * The backward signal may be roughly right:
 *
 *     netout = p - target
 *
 * But the printed loss seems to be reading from the wrong object:
 *
 *     mutated gradient buffer, not probability
 *
 * and possibly from the wrong index:
 *
 *     input char, not next-char target
 */

That would also explain why the printed number behaves oddly while the model can still sometimes produce fragments that look meaningful.

The model may be learning something, but the instrument panel may still be wired to the wrong signal.

Discussion in the ATmosphere

Loading comments...