RNN in C - is this BPTT finally right?
hi all, did i finally get the BPTT right? i am trying to ascertain method for elementary BPTT and produce a reference friendly to my style of comprehension and impartment. in the last three years i’ve discovered that if one doesn’t have it quite right, one may still get results that make you think it’s sort of working. i’m spending a lot of time trying to figure out what i’ve understood and who is telling something they know anything about and generally wasting my life away again trying to absorb another ancient method.
this psuedo code doesn’t handle trivial things like populating the input and stuff, but if the BPTT looks in order to you, PLEASE say something to infer it, thanks!
#define T 25 #define X_S 32 #define H_S 64 #define Y_S 32
float Wxh[H_S][X_S]; float Whh[H_S][H_S]; float Why[Y_S][H_S]; // weights + biases float bh[H_S]; float by[Y_S];
float x[T][X_S]; // Input history - BPTT buffers float h[T+1][H_S]; // Hidden state history (h[-1] is h[0] initialized to 0) float y[T][Y_S];
float softsum[T];
float dWxh[H_S][X_S]; float dWhh[H_S][H_S]; float dWhy[Y_S][H_S]; // Gradients (Accumulators) float dbh[H_S]; float dby[Y_S];
void rnn_forward_and_backward(int targets[T]) { // 1. FORWARD PASS
for (int t = 0; t < T; t++) {
for (int i = 0; i < H_S; i++) {
float sum = bh[i];
for (int j = 0; j < X_S; j++) sum += Wxh[i][j] * x[t][j];
for (int j = 0; j < H_S; j++) sum += Whh[i][j] * h[t][j];
h[t+1][i] = tanhf(sum);
}
float ssum = 0; for (int i = 0; i < Y_S; i++) { float sum = by[i]; for (int j = 0; j < H_S; j++) sum += Why[i][j] * h[t+1][j]; y[t][i] = exp(sum); ssum += sum; // y[t][i] = sum; } softsum[t] = 0; if (ssum > 0.f) { softsum[t] = ssum = 1.f / ssum; for (int i = 0; i < Y_S; i++) y[t][i] *= ssum; } }
float dh_next[H_S] = {0.0f};
// zero out gradient accumulators before BPTT dWxh, dWhh, dWhy, dbh, dby to 0
for (int t = T - 1; t >= 0; t--) { // BPTT back propogation through time
float dy[Y_S];
for (int i = 0; i < Y_S; i++) {
float softmax_out = expf(y[t][i]) / softsum[t];
float target_out = (i == targets[t]) ? 1.0f : 0.0f;
dy[i] = softmax_out - target_out;
}
for (int i = 0; i < Y_S; i++) {
dby[i] += dy[i];
for (int j = 0; j < H_S; j++) dWhy[i][j] += dy[i] * h[t+1][j];
}
float dh[H_S];
for (int i = 0; i < H_S; i++) {
float sum = dh_next[i];
for (int j = 0; j < Y_S; j++) sum += Why[j][i] * dy[j];
dh[i] = sum;
}
float dh_raw[H_S];
for (int i = 0; i < H_S; i++) dh_raw[i] = (1.0f - (h[t+1][i] * h[t+1][i])) * dh[i];
for (int i = 0; i < H_S; i++) {
dbh[i] += dh_raw[i];
for (int j = 0; j < X_S; j++) dWxh[i][j] += dh_raw[i] * x[t][j]; // optimise j with storing valid i for x[t]
for (int j = 0; j < H_S; j++) dWhh[i][j] += dh_raw[i] * h[t][j];
}
for (int i = 0; i < H_S; i++) {
float sum = 0.0f;
for (int j = 0; j < H_S; j++) sum += Whh[j][i] * dh_raw[j];
dh_next[i] = sum;
}
}
}
to me this style of notation tells someone exactly how an operation is performed (as long as they understand += style operators). but for the last thirty years, people basically say they want variable names like little novels and to object orient all the code in seventeen documents so it takes you three days to discern if anyone has a clue what they’re talking about.
thank you!
Discussion in the ATmosphere