Pooling in CNNs: Shrink the Map, Keep What Matters
After a few conv layers you're drowning in feature maps โ too big and slow. Pooling shrinks them while keeping the signal, and it has zero parameters. Four numbers in, one out.
๐ช Max vs average pooling: https://dev48v.infy.uk/dl/day9-pooling.html
The operation
for (let y = 0; y < h; y += 2) // stride 2: hop, don't slide
for (let x = 0; x < w; x += 2) {
const win = [img[y][x], img[y][x+1], img[y+1][x], img[y+1][x+1]];
out[y/2][x/2] = reduce(win); // 2ร2 โ 1 value, halves both dims
}
Max pool โ keep the strongest
const reduce = (w) => Math.max(...w);
A feature detector asks "is this feature here?" โ max pooling answers "yes, somewhere in this region it fired", keeping the feature's presence while discarding its exact location.
Average pool โ smooth it
const reduce = (w) => w.reduce((a,b)=>a+b) / w.length;
Common at the end of a network (global average pooling) to summarise each map before the classifier.
The real point: translation tolerance
Because pooling reports "feature present in region", shifting the input a pixel barely changes the output. That invariance is why a CNN recognises a cat whether it's top-left or centre.
And it's free โ no learnable weights. (Modern nets sometimes use strided convolutions instead, but the intent is identical.)
The takeaway
Hop in 2ร2s, keep the max โ smaller maps, position-tolerant, zero params. Pool a grid.
Discussion in the ATmosphere