RNN in C - is this BPTT finally right?
Hmm, maybe something like this…?
I think the BPTT part is close , but I would not call this “finally right” yet. The main issue I see is not really the recurrent part of BPTT; it is the output layer / softmax / cross-entropy path.
The short version:
| Part | My read |
|---|---|
| Hidden-state recurrence | Looks structurally reasonable |
dh = Why^T * dy + dh_next |
Correct idea |
dh_raw = (1 - h^2) * dh |
Correct for tanh |
dWxh += dh_raw * x[t] |
Correct idea |
dWhh += dh_raw * h[t] |
Correct if your indexing is h[t+1] = f(x[t], h[t]) |
dh_next = Whh^T * dh_raw |
Correct idea |
| Softmax forward | Looks wrong |
| Cross-entropy gradient | Looks wrong because the softmax value is wrong |
| Final confidence | Needs a small gradient check |
The biggest red flag is that y appears to be used as multiple different things: logits, exponentiated logits, and something like a softmax output. I would split it into two arrays:
float z[T][Y_S]; // logits, raw output scores
float p[T][Y_S]; // softmax probabilities
That one naming change tends to eliminate a lot of bugs.
1. The core bug: logits, exp(logits), and probabilities are mixed
This part looks suspicious:
y[t][i] = exp(sum);
ssum += sum;
For softmax, if sum is the logit, the denominator should accumulate exp(sum), not sum.
The usual softmax is:
p_i = exp(z_i) / sum_j exp(z_j)
So if you store exp(sum) in y[t][i], then the denominator would need to sum those exponentials. But then later this happens:
softmax_out = expf(y[t][i]) / softsum[t];
That means you are exponentiating y[t][i] again. If y[t][i] already contains exp(logit), this becomes something like:
exp(exp(logit))
That is not softmax.
A cleaner implementation is:
z[t][i] = sum; // raw logit
p[t][i] = softmax(z); // probability
This is also the convention used in Andrej Karpathy’s minimal character RNN. In min-char-rnn.py, the forward pass separates:
ys[t] = np.dot(Why, hs[t]) + by
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))
loss += -np.log(ps[t][targets[t], 0])
Here:
| Karpathy variable | Meaning |
|---|---|
ys[t] |
logits / unnormalized scores |
ps[t] |
softmax probabilities |
targets[t] |
correct class index |
That separation is very useful. I would copy that idea directly into the C version.
2. Use a numerically stable softmax
Even after fixing the logic, I would not write the naive softmax in C. Use the standard max-subtraction trick.
Both CS231n’s softmax notes and Jay Mody’s Numerically Stable Softmax and Cross Entropy explain the same point: exponentials can overflow, and subtracting the maximum logit gives an equivalent but much safer computation.
Use something like this:
// z[t][i] already contains the logits for time step t.
// p[t][i] will contain the probabilities.
float max_z = z[t][0];
for (int i = 1; i < Y_S; i++) {
if (z[t][i] > max_z) {
max_z = z[t][i];
}
}
float denom = 0.0f;
for (int i = 0; i < Y_S; i++) {
p[t][i] = expf(z[t][i] - max_z);
denom += p[t][i];
}
for (int i = 0; i < Y_S; i++) {
p[t][i] /= denom;
}
This has a few advantages:
| Problem | Why this helps |
|---|---|
Overflow in expf |
The largest exponent becomes expf(0) = 1 |
| Division by zero | At least one denominator term is 1 |
| Ambiguous variable meaning | z is logits, p is probability |
| Easier backward pass | dy = p - one_hot(target) |
You can also compute the loss stably with log-sum-exp:
float log_denom = logf(denom) + max_z;
loss += -z[t][target[t]] + log_denom;
or, if you already computed p:
loss += -logf(p[t][target[t]] + 1e-12f);
The log-sum-exp version is cleaner numerically.
3. The output gradient should be p - one_hot(target)
For softmax + cross-entropy, the gradient with respect to the logits is the classic:
dy = p;
dy[target] -= 1.0f;
This is what Karpathy’s implementation does:
dy = np.copy(ps[t])
dy[targets[t]] -= 1
See also the CS231n neural-net case study section on the softmax gradient: CS231n: Training a Softmax Linear Classifier.
In C, I would write:
float dy[Y_S];
for (int i = 0; i < Y_S; i++) {
dy[i] = p[t][i];
}
dy[target[t]] -= 1.0f;
Then the output-layer gradients are:
for (int i = 0; i < Y_S; i++) {
dby[i] += dy[i];
for (int j = 0; j < H_S; j++) {
dWhy[i][j] += dy[i] * h[t + 1][j];
}
}
This assumes your forward recurrence is:
h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t] = Why * h[t + 1] + by;
p[t] = softmax(z[t]);
That indexing convention is fine. It just means h[t] is the previous state and h[t+1] is the current state.
4. Your BPTT recurrence looks mostly right
Given this forward pass:
h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t] = Why * h[t + 1] + by;
p[t] = softmax(z[t]);
The backward pass should look like this:
for (int t = T - 1; t >= 0; t--) {
// 1. Output gradient
for (int i = 0; i < Y_S; i++) {
dy[i] = p[t][i];
}
dy[target[t]] -= 1.0f;
// 2. Gradients for Why and by
for (int i = 0; i < Y_S; i++) {
dby[i] += dy[i];
for (int j = 0; j < H_S; j++) {
dWhy[i][j] += dy[i] * h[t + 1][j];
}
}
// 3. Backprop into current hidden state
for (int i = 0; i < H_S; i++) {
float sum = dh_next[i];
for (int j = 0; j < Y_S; j++) {
sum += Why[j][i] * dy[j];
}
dh[i] = sum;
}
// 4. Backprop through tanh
for (int i = 0; i < H_S; i++) {
dh_raw[i] = (1.0f - h[t + 1][i] * h[t + 1][i]) * dh[i];
dbh[i] += dh_raw[i];
}
// 5. Gradients for Wxh and Whh
for (int i = 0; i < H_S; i++) {
for (int j = 0; j < X_S; j++) {
dWxh[i][j] += dh_raw[i] * x[t][j];
}
for (int j = 0; j < H_S; j++) {
dWhh[i][j] += dh_raw[i] * h[t][j];
}
}
// 6. Propagate hidden gradient to previous time step
for (int i = 0; i < H_S; i++) {
float sum = 0.0f;
for (int j = 0; j < H_S; j++) {
sum += Whh[j][i] * dh_raw[j];
}
dh_next[i] = sum;
}
}
This matches the same structure as Karpathy’s minimal RNN:
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext
dhraw = (1 - hs[t] * hs[t]) * dh
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
The indexing difference is only this:
| Concept | Karpathy | Your indexing |
|---|---|---|
| previous hidden | hs[t-1] |
h[t] |
| current hidden | hs[t] |
h[t+1] |
| output logits | ys[t] |
z[t] |
| output probabilities | ps[t] |
p[t] |
So the recurrent part looks conceptually right.
5. Why the += accumulation is correct
The += on dWxh, dWhh, dWhy, dbh, and dby is not just an implementation detail. It is required.
In BPTT, the RNN is unrolled over time, but the parameters are shared at every time step. Therefore each occurrence contributes to the same parameter gradient. Dive into Deep Learning: Backpropagation Through Time explains this as standard backpropagation on the unrolled computational graph, with gradients summed across all repeated occurrences of the same parameter.
So these are correct in spirit:
dWhy[i][j] += dy[i] * h[t + 1][j];
dWxh[i][j] += dh_raw[i] * x[t][j];
dWhh[i][j] += dh_raw[i] * h[t][j];
dbh[i] += dh_raw[i];
dby[i] += dy[i];
The important point is that the values being accumulated must be correct. Right now the softmax path makes dy questionable, so it contaminates the rest of the gradients.
6. Reset gradient accumulators every update
Before each sequence/minibatch update, make sure all gradient arrays are zeroed:
memset(dWxh, 0, sizeof dWxh);
memset(dWhh, 0, sizeof dWhh);
memset(dWhy, 0, sizeof dWhy);
memset(dbh, 0, sizeof dbh);
memset(dby, 0, sizeof dby);
Also reset dh_next before the backward loop:
for (int i = 0; i < H_S; i++) {
dh_next[i] = 0.0f;
}
Otherwise the gradient from a previous backward pass can leak into the current one.
7. Add gradient clipping
Vanilla RNNs are very prone to exploding gradients. Karpathy’s min-char-rnn.py clips every gradient array:
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam)
D2L also points out that long sequences create both computational and numerical instability, including exploding/vanishing gradients: D2L BPTT.
In C:
static inline float clipf(float v, float lo, float hi) {
if (v < lo) return lo;
if (v > hi) return hi;
return v;
}
Then apply it to all gradient arrays before the parameter update:
dWxh[i][j] = clipf(dWxh[i][j], -5.0f, 5.0f);
dWhh[i][j] = clipf(dWhh[i][j], -5.0f, 5.0f);
dWhy[i][j] = clipf(dWhy[i][j], -5.0f, 5.0f);
dbh[i] = clipf(dbh[i], -5.0f, 5.0f);
dby[i] = clipf(dby[i], -5.0f, 5.0f);
8. Use CrossEntropyLoss conventions as a sanity check
A useful mental model is PyTorch’s CrossEntropyLoss. PyTorch expects unnormalized logits as input, not already-softmaxed probabilities. It internally combines log-softmax and negative log-likelihood.
For a manual C implementation, that means:
| Stage | Store |
|---|---|
| Output affine layer | logits z[t][i] |
| Softmax | probabilities p[t][i] |
| Loss | -log(p[t][target[t]]) or stable log-sum-exp |
| Backward into logits | dy[i] = p[t][i] - one_hot(target[t]) |
Do not do:
prob = softmax(exp(logit));
Do:
prob = softmax(logit);
9. The fastest way to know: gradient check
After fixing the softmax path, I would not rely on visual inspection. Add a tiny finite-difference gradient check.
Use a very small model:
#define T 2
#define X_S 3
#define H_S 4
#define Y_S 3
Then check a few parameters from each group:
float eps = 1e-3f;
float old = Wxh[0][0];
Wxh[0][0] = old + eps;
float loss_plus = forward_loss_only();
Wxh[0][0] = old - eps;
float loss_minus = forward_loss_only();
Wxh[0][0] = old;
float numerical = (loss_plus - loss_minus) / (2.0f * eps);
float analytic = dWxh[0][0];
float relerr = fabsf(numerical - analytic) /
fmaxf(1e-8f, fabsf(numerical) + fabsf(analytic));
PyTorch’s gradcheck is basically the same idea: compare analytical gradients against small finite-difference estimates. There is also a useful Karpathy-style gist with gradient checking added: taey16/min_char_rnn.
I would test:
| Parameter | What failure suggests |
|---|---|
by |
output gradient / softmax / target indexing bug |
Why |
output gradient or current hidden indexing bug |
bh |
tanh backward bug |
Wxh |
input-to-hidden gradient bug |
Whh |
recurrent gradient or previous hidden indexing bug |
If by and Why fail first, the bug is almost certainly in the output layer. If those pass but Wxh/Whh fail, then look at dh_next, dh_raw, or the h[t] vs h[t+1] indexing.
10. Minimal corrected structure
Putting the important parts together, the forward pass should be shaped like this:
// h[0] should contain the initial hidden state.
for (int t = 0; t < T; t++) {
// Hidden state
for (int i = 0; i < H_S; i++) {
float sum = bh[i];
for (int j = 0; j < X_S; j++) {
sum += Wxh[i][j] * x[t][j];
}
for (int j = 0; j < H_S; j++) {
sum += Whh[i][j] * h[t][j];
}
h[t + 1][i] = tanhf(sum);
}
// Logits
for (int i = 0; i < Y_S; i++) {
float sum = by[i];
for (int j = 0; j < H_S; j++) {
sum += Why[i][j] * h[t + 1][j];
}
z[t][i] = sum;
}
// Stable softmax
float max_z = z[t][0];
for (int i = 1; i < Y_S; i++) {
if (z[t][i] > max_z) {
max_z = z[t][i];
}
}
float denom = 0.0f;
for (int i = 0; i < Y_S; i++) {
p[t][i] = expf(z[t][i] - max_z);
denom += p[t][i];
}
for (int i = 0; i < Y_S; i++) {
p[t][i] /= denom;
}
// Loss
loss += -logf(p[t][target[t]] + 1e-12f);
}
And the backward pass should begin from:
for (int i = 0; i < Y_S; i++) {
dy[i] = p[t][i];
}
dy[target[t]] -= 1.0f;
Then continue with the BPTT recurrence you already mostly have.
Final verdict
I would phrase it this way:
The recurrent/BPTT skeleton is mostly right, but the output-layer math is not. Fix the softmax path first. Store raw logits separately from probabilities. Use stable softmax. Use
dy = p - one_hot(target)for softmax + cross-entropy. Then run a small gradient check before trusting the implementation.
So, not “finally right” yet, but probably not far off.
Useful references:
- Andrej Karpathy — min-char-rnn.py
- CS231n — Softmax classifier
- CS231n — softmax gradient in the neural-net case study
- Jay Mody — Numerically Stable Softmax and Cross Entropy
- Dive into Deep Learning — Backpropagation Through Time
- PyTorch — CrossEntropyLoss
- PyTorch — gradcheck
- taey16 — min_char_rnn with gradient checking
Discussion in the ATmosphere