External Publication
Visit Post

RNN in C - is this BPTT finally right?

Hugging Face Forums [Unofficial] June 2, 2026
Source

Hmm, maybe something like this…?


I think the BPTT part is close , but I would not call this “finally right” yet. The main issue I see is not really the recurrent part of BPTT; it is the output layer / softmax / cross-entropy path.

The short version:

Part My read
Hidden-state recurrence Looks structurally reasonable
dh = Why^T * dy + dh_next Correct idea
dh_raw = (1 - h^2) * dh Correct for tanh
dWxh += dh_raw * x[t] Correct idea
dWhh += dh_raw * h[t] Correct if your indexing is h[t+1] = f(x[t], h[t])
dh_next = Whh^T * dh_raw Correct idea
Softmax forward Looks wrong
Cross-entropy gradient Looks wrong because the softmax value is wrong
Final confidence Needs a small gradient check

The biggest red flag is that y appears to be used as multiple different things: logits, exponentiated logits, and something like a softmax output. I would split it into two arrays:

float z[T][Y_S];  // logits, raw output scores
float p[T][Y_S];  // softmax probabilities

That one naming change tends to eliminate a lot of bugs.

1. The core bug: logits, exp(logits), and probabilities are mixed

This part looks suspicious:

y[t][i] = exp(sum);
ssum += sum;

For softmax, if sum is the logit, the denominator should accumulate exp(sum), not sum.

The usual softmax is:

p_i = exp(z_i) / sum_j exp(z_j)

So if you store exp(sum) in y[t][i], then the denominator would need to sum those exponentials. But then later this happens:

softmax_out = expf(y[t][i]) / softsum[t];

That means you are exponentiating y[t][i] again. If y[t][i] already contains exp(logit), this becomes something like:

exp(exp(logit))

That is not softmax.

A cleaner implementation is:

z[t][i] = sum;        // raw logit
p[t][i] = softmax(z); // probability

This is also the convention used in Andrej Karpathy’s minimal character RNN. In min-char-rnn.py, the forward pass separates:

ys[t] = np.dot(Why, hs[t]) + by
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))
loss += -np.log(ps[t][targets[t], 0])

Here:

Karpathy variable Meaning
ys[t] logits / unnormalized scores
ps[t] softmax probabilities
targets[t] correct class index

That separation is very useful. I would copy that idea directly into the C version.

2. Use a numerically stable softmax

Even after fixing the logic, I would not write the naive softmax in C. Use the standard max-subtraction trick.

Both CS231n’s softmax notes and Jay Mody’s Numerically Stable Softmax and Cross Entropy explain the same point: exponentials can overflow, and subtracting the maximum logit gives an equivalent but much safer computation.

Use something like this:

// z[t][i] already contains the logits for time step t.
// p[t][i] will contain the probabilities.

float max_z = z[t][0];

for (int i = 1; i < Y_S; i++) {
    if (z[t][i] > max_z) {
        max_z = z[t][i];
    }
}

float denom = 0.0f;

for (int i = 0; i < Y_S; i++) {
    p[t][i] = expf(z[t][i] - max_z);
    denom += p[t][i];
}

for (int i = 0; i < Y_S; i++) {
    p[t][i] /= denom;
}

This has a few advantages:

Problem Why this helps
Overflow in expf The largest exponent becomes expf(0) = 1
Division by zero At least one denominator term is 1
Ambiguous variable meaning z is logits, p is probability
Easier backward pass dy = p - one_hot(target)

You can also compute the loss stably with log-sum-exp:

float log_denom = logf(denom) + max_z;
loss += -z[t][target[t]] + log_denom;

or, if you already computed p:

loss += -logf(p[t][target[t]] + 1e-12f);

The log-sum-exp version is cleaner numerically.

3. The output gradient should be p - one_hot(target)

For softmax + cross-entropy, the gradient with respect to the logits is the classic:

dy = p;
dy[target] -= 1.0f;

This is what Karpathy’s implementation does:

dy = np.copy(ps[t])
dy[targets[t]] -= 1

See also the CS231n neural-net case study section on the softmax gradient: CS231n: Training a Softmax Linear Classifier.

In C, I would write:

float dy[Y_S];

for (int i = 0; i < Y_S; i++) {
    dy[i] = p[t][i];
}

dy[target[t]] -= 1.0f;

Then the output-layer gradients are:

for (int i = 0; i < Y_S; i++) {
    dby[i] += dy[i];

    for (int j = 0; j < H_S; j++) {
        dWhy[i][j] += dy[i] * h[t + 1][j];
    }
}

This assumes your forward recurrence is:

h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t]     = Why * h[t + 1] + by;
p[t]     = softmax(z[t]);

That indexing convention is fine. It just means h[t] is the previous state and h[t+1] is the current state.

4. Your BPTT recurrence looks mostly right

Given this forward pass:

h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t]     = Why * h[t + 1] + by;
p[t]     = softmax(z[t]);

The backward pass should look like this:

for (int t = T - 1; t >= 0; t--) {
    // 1. Output gradient
    for (int i = 0; i < Y_S; i++) {
        dy[i] = p[t][i];
    }
    dy[target[t]] -= 1.0f;

    // 2. Gradients for Why and by
    for (int i = 0; i < Y_S; i++) {
        dby[i] += dy[i];

        for (int j = 0; j < H_S; j++) {
            dWhy[i][j] += dy[i] * h[t + 1][j];
        }
    }

    // 3. Backprop into current hidden state
    for (int i = 0; i < H_S; i++) {
        float sum = dh_next[i];

        for (int j = 0; j < Y_S; j++) {
            sum += Why[j][i] * dy[j];
        }

        dh[i] = sum;
    }

    // 4. Backprop through tanh
    for (int i = 0; i < H_S; i++) {
        dh_raw[i] = (1.0f - h[t + 1][i] * h[t + 1][i]) * dh[i];
        dbh[i] += dh_raw[i];
    }

    // 5. Gradients for Wxh and Whh
    for (int i = 0; i < H_S; i++) {
        for (int j = 0; j < X_S; j++) {
            dWxh[i][j] += dh_raw[i] * x[t][j];
        }

        for (int j = 0; j < H_S; j++) {
            dWhh[i][j] += dh_raw[i] * h[t][j];
        }
    }

    // 6. Propagate hidden gradient to previous time step
    for (int i = 0; i < H_S; i++) {
        float sum = 0.0f;

        for (int j = 0; j < H_S; j++) {
            sum += Whh[j][i] * dh_raw[j];
        }

        dh_next[i] = sum;
    }
}

This matches the same structure as Karpathy’s minimal RNN:

dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext
dhraw = (1 - hs[t] * hs[t]) * dh
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)

The indexing difference is only this:

Concept Karpathy Your indexing
previous hidden hs[t-1] h[t]
current hidden hs[t] h[t+1]
output logits ys[t] z[t]
output probabilities ps[t] p[t]

So the recurrent part looks conceptually right.

5. Why the += accumulation is correct

The += on dWxh, dWhh, dWhy, dbh, and dby is not just an implementation detail. It is required.

In BPTT, the RNN is unrolled over time, but the parameters are shared at every time step. Therefore each occurrence contributes to the same parameter gradient. Dive into Deep Learning: Backpropagation Through Time explains this as standard backpropagation on the unrolled computational graph, with gradients summed across all repeated occurrences of the same parameter.

So these are correct in spirit:

dWhy[i][j] += dy[i] * h[t + 1][j];
dWxh[i][j] += dh_raw[i] * x[t][j];
dWhh[i][j] += dh_raw[i] * h[t][j];
dbh[i]     += dh_raw[i];
dby[i]     += dy[i];

The important point is that the values being accumulated must be correct. Right now the softmax path makes dy questionable, so it contaminates the rest of the gradients.

6. Reset gradient accumulators every update

Before each sequence/minibatch update, make sure all gradient arrays are zeroed:

memset(dWxh, 0, sizeof dWxh);
memset(dWhh, 0, sizeof dWhh);
memset(dWhy, 0, sizeof dWhy);
memset(dbh,  0, sizeof dbh);
memset(dby,  0, sizeof dby);

Also reset dh_next before the backward loop:

for (int i = 0; i < H_S; i++) {
    dh_next[i] = 0.0f;
}

Otherwise the gradient from a previous backward pass can leak into the current one.

7. Add gradient clipping

Vanilla RNNs are very prone to exploding gradients. Karpathy’s min-char-rnn.py clips every gradient array:

for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam)

D2L also points out that long sequences create both computational and numerical instability, including exploding/vanishing gradients: D2L BPTT.

In C:

static inline float clipf(float v, float lo, float hi) {
    if (v < lo) return lo;
    if (v > hi) return hi;
    return v;
}

Then apply it to all gradient arrays before the parameter update:

dWxh[i][j] = clipf(dWxh[i][j], -5.0f, 5.0f);
dWhh[i][j] = clipf(dWhh[i][j], -5.0f, 5.0f);
dWhy[i][j] = clipf(dWhy[i][j], -5.0f, 5.0f);
dbh[i]     = clipf(dbh[i],     -5.0f, 5.0f);
dby[i]     = clipf(dby[i],     -5.0f, 5.0f);

8. Use CrossEntropyLoss conventions as a sanity check

A useful mental model is PyTorch’s CrossEntropyLoss. PyTorch expects unnormalized logits as input, not already-softmaxed probabilities. It internally combines log-softmax and negative log-likelihood.

For a manual C implementation, that means:

Stage Store
Output affine layer logits z[t][i]
Softmax probabilities p[t][i]
Loss -log(p[t][target[t]]) or stable log-sum-exp
Backward into logits dy[i] = p[t][i] - one_hot(target[t])

Do not do:

prob = softmax(exp(logit));

Do:

prob = softmax(logit);

9. The fastest way to know: gradient check

After fixing the softmax path, I would not rely on visual inspection. Add a tiny finite-difference gradient check.

Use a very small model:

#define T   2
#define X_S 3
#define H_S 4
#define Y_S 3

Then check a few parameters from each group:

float eps = 1e-3f;

float old = Wxh[0][0];

Wxh[0][0] = old + eps;
float loss_plus = forward_loss_only();

Wxh[0][0] = old - eps;
float loss_minus = forward_loss_only();

Wxh[0][0] = old;

float numerical = (loss_plus - loss_minus) / (2.0f * eps);
float analytic = dWxh[0][0];

float relerr = fabsf(numerical - analytic) /
               fmaxf(1e-8f, fabsf(numerical) + fabsf(analytic));

PyTorch’s gradcheck is basically the same idea: compare analytical gradients against small finite-difference estimates. There is also a useful Karpathy-style gist with gradient checking added: taey16/min_char_rnn.

I would test:

Parameter What failure suggests
by output gradient / softmax / target indexing bug
Why output gradient or current hidden indexing bug
bh tanh backward bug
Wxh input-to-hidden gradient bug
Whh recurrent gradient or previous hidden indexing bug

If by and Why fail first, the bug is almost certainly in the output layer. If those pass but Wxh/Whh fail, then look at dh_next, dh_raw, or the h[t] vs h[t+1] indexing.

10. Minimal corrected structure

Putting the important parts together, the forward pass should be shaped like this:

// h[0] should contain the initial hidden state.

for (int t = 0; t < T; t++) {
    // Hidden state
    for (int i = 0; i < H_S; i++) {
        float sum = bh[i];

        for (int j = 0; j < X_S; j++) {
            sum += Wxh[i][j] * x[t][j];
        }

        for (int j = 0; j < H_S; j++) {
            sum += Whh[i][j] * h[t][j];
        }

        h[t + 1][i] = tanhf(sum);
    }

    // Logits
    for (int i = 0; i < Y_S; i++) {
        float sum = by[i];

        for (int j = 0; j < H_S; j++) {
            sum += Why[i][j] * h[t + 1][j];
        }

        z[t][i] = sum;
    }

    // Stable softmax
    float max_z = z[t][0];

    for (int i = 1; i < Y_S; i++) {
        if (z[t][i] > max_z) {
            max_z = z[t][i];
        }
    }

    float denom = 0.0f;

    for (int i = 0; i < Y_S; i++) {
        p[t][i] = expf(z[t][i] - max_z);
        denom += p[t][i];
    }

    for (int i = 0; i < Y_S; i++) {
        p[t][i] /= denom;
    }

    // Loss
    loss += -logf(p[t][target[t]] + 1e-12f);
}

And the backward pass should begin from:

for (int i = 0; i < Y_S; i++) {
    dy[i] = p[t][i];
}

dy[target[t]] -= 1.0f;

Then continue with the BPTT recurrence you already mostly have.

Final verdict

I would phrase it this way:

The recurrent/BPTT skeleton is mostly right, but the output-layer math is not. Fix the softmax path first. Store raw logits separately from probabilities. Use stable softmax. Use dy = p - one_hot(target) for softmax + cross-entropy. Then run a small gradient check before trusting the implementation.

So, not “finally right” yet, but probably not far off.

Useful references:

  • Andrej Karpathy — min-char-rnn.py
  • CS231n — Softmax classifier
  • CS231n — softmax gradient in the neural-net case study
  • Jay Mody — Numerically Stable Softmax and Cross Entropy
  • Dive into Deep Learning — Backpropagation Through Time
  • PyTorch — CrossEntropyLoss
  • PyTorch — gradcheck
  • taey16 — min_char_rnn with gradient checking

Discussion in the ATmosphere

Loading comments...