top of page



Residual neural networks (ResNets) were introduced in Deep Residual Learning for Image Recognition in 2015. A ResNet's defining characteristic is known as a "skip connection" and catalyzed the creation of deep neural networks with hundreds of layers, much deeper than previous neural networks.


One of the building blocks of residual learning used in ResNets.

The notebook below is my implementation of the ResNet architecture from scratch. There are some subtleties in the implementation that I needed to reference the PyTorch implementation to get correct. The more interesting bit is the theoretical interpretation of what is going on. I've written an explanation below of the intuition behind ResNets and the skip connections they contain.


Residual layers unlocked the field of AI to use hundreds of more layers than previously used.


Neural network architectures are capable of learning a class of functions called F. If there is a "truth" function f* that we would like to find, we might get lucky while training our dataset and arrive at it but we may also arrive at a situation where the function classes we are learning are no where close to what's desired. We usually assume that larger functions are more expressive and have greater capacity to capture the "truth" function, f*. However, for non-nested function classes illustrated above, a larger function class does not necessarily move closer to the "truth" function. With nested function classes we avoid this issue because larger function classes will contain the smaller ones so we know an increase in depth in a network is strictly increasing the expressive power of the network.

ResNets can be interpreted as a way to force your network to learn nested function classes. This 

By training newly-added layers into an identity function f(x) = x, the new model becomes as effective as the original model. ResNets ensure that every additional layer can easily contain the identity function as one of its elements when the optimal depth of a network has been trained while subsequent layers remain. This insight comes from the book Dive Into Deep Learning.

bottom of page