Convolutional Neural Networks (CNN)

Author

David I. Inouye

Why Convolution Networks?: Neurocientific Inspiration

Gabor Functions Derived From Neuroscience Experiments Are Simple Convolutional Filters [DL, ch. 9]


Why Convolution Networks?: Neurocientific Inspiration

Convolutional Networks Automatically Learn Filters Similar to Gabor Functions [DL, ch. 9]


Why Convolutional Networks?: Computational Reasons

  • Sparse computation (compared to full deep linear networks)
    • Computationally efficient (can be implemented with fast libraries)
    • \(O(n \times k)\) instead of \(O(n^2)\) for fully connected layer
  • Shared parameters (only a small number of shared parameters)
    • Comparison of number of parameters for fully connected vs convolutional layer:
      • Fully connected: \(O(n_{in} \times n_{out})\)
      • Convolutional: \(O(k \times k \times c_{in} \times c_{out})\) where \(k\) is kernel size and \(c\) is number of channels
    • Fewer parameters \(\to\) less data needed to train
  • Translation invariance
    • Convolutional layers can detect features regardless of their position

1D Convolutions Are Similar but Slightly Different Than Signal Processing / Math Convolutions


Padding or Stride Parameters Alter the Computation and Output Shape


1D Convolutions With Padding


1D Demo: 1D convolutions, similar but slightly different than signal processing / math convolutions

[ -1, 1] filter/kernel highlights “sharp points” of signal

import torch
import matplotlib.pyplot as plt
%matplotlib inline

t = torch.linspace(0, 1.0, 300)
x = (torch.cos(10*t) > 0.0).float() + 0.1*torch.sin(100*t)-0.5
plt.plot(t.numpy(), x.numpy(), label='Original Signal')

from torch.nn import functional as F
filt = torch.tensor([-1, 1.0])
print('Filter')
print(filt)
# Should have shape $(m, c, w)$ where m is minibatch size, c is # channels and w is width
y = F.conv1d(
    x.reshape(1, 1, len(x)), 
    filt.reshape(1, 1, len(filt))
).squeeze_()
plt.plot(
    t.numpy()[:len(y)], y.numpy(), 
    label='After Convolution')
plt.legend()
Filter
tensor([-1.,  1.])
[W1008 12:06:02.182233000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.

Convolutions are linear operators (i.e., matrix multiplication) with shared parameters

x = torch.randn(10).float().requires_grad_(True)
filt = torch.tensor([-1, 1]).float()
#filt = torch.tensor([1, 2, 3, 4]).float()
y = F.conv1d(x.reshape(1, 1, len(x)), filt.reshape(1, 1, len(filt))).squeeze_()

def extract_jacobian(x, y):
    J = torch.zeros((len(y), len(x))).float()
    for i in range(len(y)):
        v = torch.zeros(len(y)).float()
        v[i] = 1
        if x.grad is not None:
            x.grad.zero_()
        y.backward(v, retain_graph=True)
        J[i, :] = x.grad
    return J

A = extract_jacobian(x, y)
print(A)
y2 = torch.matmul(A, x)
print(y)
print(y2)
print(y-y2)
[W1008 12:06:02.296760000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
tensor([[-1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0., -1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0., -1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., -1.,  1.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0., -1.,  1.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0., -1.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0., -1.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  1.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  1.]])
tensor([ 0.0651,  0.0130,  1.1876,  1.8053, -1.4806, -0.4799,  1.3002, -1.4347,
         0.4389], grad_fn=<SqueezeBackward3>)
tensor([ 0.0651,  0.0130,  1.1876,  1.8053, -1.4806, -0.4799,  1.3002, -1.4347,
         0.4389], grad_fn=<MvBackward0>)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], grad_fn=<SubBackward0>)
[W1008 12:06:02.659086000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.662492000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663501000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663564000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663620000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663670000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663719000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663769000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.
[W1008 12:06:02.663815000 NNPACK.cpp:57] Could not initialize NNPACK! Reason: Unsupported hardware.

2D Convolutions Are Simple Generalizations to Matrices


2D convolutions are similar and can be applied to images

Different filters extract different features from the image

import sklearn.datasets
A = torch.tensor(sklearn.datasets.load_sample_image('china.jpg')).float()
A = torch.tensor(sklearn.datasets.load_sample_image('flower.jpg')).float()
A = torch.sum(A, dim=2) # Sum channels

for filt in [
    torch.tensor([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]).float(), # Horizontal
    torch.tensor([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]).float().t(), # Vertical
    torch.tensor([[1, -1], [-1, 1]]).float(), # Checker board pattern
    torch.ones((10, 10)).float(), # Blur
]:
    print('Filter size', filt.size(), 'A size', A.size())
    print(filt)
    B = F.conv2d(A.reshape(1, 1, *A.size()), filt.reshape(1, 1, *filt.size()), padding=1).squeeze()
    #B = F.conv2d(A.reshape(1, 1, *A.size()), filt.reshape(1, 1, *filt.size())).squeeze()

    fig, axes = plt.subplots(1, 2, figsize=(14,4))
    axes[0].imshow(A.numpy(), cmap='gray')
    axes[1].imshow(B.numpy(), cmap='gray')
Filter size torch.Size([3, 3]) A size torch.Size([427, 640])
tensor([[-1.,  0.,  1.],
        [-1.,  0.,  1.],
        [-1.,  0.,  1.]])
Filter size torch.Size([3, 3]) A size torch.Size([427, 640])
tensor([[-1., -1., -1.],
        [ 0.,  0.,  0.],
        [ 1.,  1.,  1.]])
Filter size torch.Size([2, 2]) A size torch.Size([427, 640])
tensor([[ 1., -1.],
        [-1.,  1.]])
Filter size torch.Size([10, 10]) A size torch.Size([427, 640])
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

2D Convolutions With Channels Are Like Simple 2D Convolutions but All Arrays Have a Channel Dimension

\(f_h \times f_w\) convolution” (channel dimension is assumed)


2D convolutions with channel dimension are similar (i.e., if there is more than 1 channel)

A = torch.tensor(sklearn.datasets.load_sample_image('flower.jpg')).float()
A = A/255
A = A.permute(2,0,1)

for filt in [
    torch.tensor([1, 0, 0]).reshape(3, 1, 1).float(), 
    torch.tensor([0, 1, 0]).reshape(3, 1, 1).float(), 
    torch.tensor([0, 0, 1]).reshape(3, 1, 1).float(),
]:
    print('Filter size', filt.size(), 'A size', A.size(), 'B size', B.size())
    print(filt)
    B = F.conv2d(
        A.reshape(1, *A.size()), 
        filt.reshape(1, *filt.size())
    ).squeeze()

    fig, axes = plt.subplots(1, 2, figsize=(14,4))
    axes[0].imshow(A.permute(1,2,0), cmap='gray')
    axes[1].imshow(B, cmap='gray')
Filter size torch.Size([3, 1, 1]) A size torch.Size([3, 427, 640]) B size torch.Size([420, 633])
tensor([[[1.]],

        [[0.]],

        [[0.]]])
Filter size torch.Size([3, 1, 1]) A size torch.Size([3, 427, 640]) B size torch.Size([427, 640])
tensor([[[0.]],

        [[1.]],

        [[0.]]])
Filter size torch.Size([3, 1, 1]) A size torch.Size([3, 427, 640]) B size torch.Size([427, 640])
tensor([[[0.]],

        [[0.]],

        [[1.]]])


Multiple Convolutions Increase the Output Channel Dimension


Reasoning About Input and Output Shapes Is Important for Debugging and Designing CNNs

  • Convolution input parameters
    • \(ChannelIn = C_{in}\)
    • \(ChannelOut = C_{out}\) (equivalent to # filters)
    • \(KernelSize = [K_0, K_1]\)
    • \(Stride = [S_0, S_1]\)
    • \(Padding = [P_0, P_1]\)
  • \(C_{out}\) = # filters
  • Output spatial dimensions
    • \[ H_{out} = \lfloor \tfrac{H_{in} + 2 P_0 - K_0}{S_0} + 1 \rfloor \]
    • \[ W_{out} = \lfloor \tfrac{W_{in} + 2 P_1 - K_1}{S_1} + 1 \rfloor \]
  • Output batch dimension should match input

Common Convolution Configurations

\[ H_{out} = \lfloor \frac{H_{in} + 2 P_0 - K_0}{S_0} + 1 \rfloor \]

  • Output has same height and width as input
    • \(1 \times 1\) convolution with padding=0, stride=1
    • \(3 \times 3\) convolution with padding=1, stride=1
    • \(5 \times 5\) convolution with padding=2, stride=1
  • Output has half the height and width of input
    • \(2 \times 2\) convolution with padding=0, stride=2
    • \(4 \times 4\) convolution with padding=1, stride=2

Need several other components for extracting features

  • Activation functions
  • Pooling layers

Why activation functions? Activation functions enable non-linear models

Consider a deep linear network

torch.manual_seed(0)
A1 = torch.randn((10, 5))
A2 = torch.randn((10, 10))
A3 = torch.randn((1, 10))

x = torch.randn(5)
print('x', x)
y = torch.matmul(A1, x)
y = torch.matmul(A2, y)
y = torch.matmul(A3, y)
print('y', y)

b = torch.matmul(A3, torch.matmul(A2, A1))
y2 = torch.matmul(b, x)
print('y2', y2)
x tensor([ 1.4875, -0.2230, -1.0057, -0.4139,  1.1600])
y tensor([4.1752])
y2 tensor([4.1752])

Why activation functions? Activation functions enable non-linear models

If you add activation functions, the deep function cannot be simplified

torch.manual_seed(0)
A1 = torch.randn((10, 5))
A2 = torch.randn((10, 10))
A3 = torch.randn((1, 10))

x = torch.randn(5)
print('x', x)
y = torch.matmul(A1, x)
y = torch.relu(y)
y = torch.matmul(A2, y)
y = torch.relu(y)
y = torch.matmul(A3, y)
print('y', y)

b = torch.matmul(A3, torch.matmul(A2, A1))
y2 = torch.matmul(b, x)
print('y2', y2)
x tensor([ 1.4875, -0.2230, -1.0057, -0.4139,  1.1600])
y tensor([18.9449])
y2 tensor([4.1752])

Without ReLU or activation function, the function can only be linear

N, D_in, H, D_out = 64, 1, 10, 1
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.Linear(H, D_out),
)
x = torch.linspace(-1, 1, 100).reshape(-1, 1)
y = model(x)
plt.plot(x.detach().numpy(), y.detach().numpy())

With ReLU activation function, the function is piecewise linear

N, D_in, H, D_out = 64, 1, 10, 1
for random_seed in [0, 1, 2, 3, 4]:
    torch.manual_seed(random_seed)
    model = torch.nn.Sequential(
        torch.nn.Linear(D_in, H),
        torch.nn.ReLU(),
        torch.nn.Linear(H, D_out),
    )
    x = torch.linspace(-1, 1, 100).reshape(-1, 1)
    y = model(x)
    plt.plot(x.detach().numpy(), y.detach().numpy())

Common activation functions include sigmoid, ReLU, Leaky ReLU, tanh

t = torch.linspace(-3, 3, 300)
fig = plt.figure(figsize=(5,3), dpi=200)
plt.plot(t.numpy(), torch.sigmoid(t).numpy(), label='sigmoid')
plt.plot(t.numpy(), F.relu(t).numpy(), label='ReLU')
plt.plot(t.numpy(), F.leaky_relu(t, negative_slope=0.25).numpy(), label='Leaky ReLU')
plt.plot(t.numpy(), torch.tanh(t).numpy(), label='tanh')
plt.legend()

Pooling layers are used to reduce dimensionality and introduce some location invariance

Max pooling layers

torch.manual_seed(0)
x = torch.randint(10, (10,)).float()
y = F.max_pool1d(x.reshape(1,1,-1), kernel_size=3)
y2 = F.max_pool1d(x.reshape(1,1,-1), kernel_size=3, stride=1)
y3 = F.max_pool1d(x.reshape(1,1,-1), kernel_size=3, stride=1, padding=1)
print(x)
print(y)
print(y2)
print(y3)
tensor([4., 9., 3., 0., 3., 9., 7., 3., 7., 3.])
tensor([[[9., 9., 7.]]])
tensor([[[9., 9., 3., 9., 9., 9., 7., 7.]]])
tensor([[[9., 9., 9., 3., 9., 9., 9., 7., 7., 7.]]])

Pooling layers are used to reduce dimensionality and introduce some location invariance

Average pooling layers

torch.manual_seed(0)
x = torch.randint(10, (10,)).float()
y = F.avg_pool1d(x.reshape(1,1,-1), kernel_size=3)
y2 = F.avg_pool1d(x.reshape(1,1,-1), kernel_size=3, stride=1)
y3 = F.avg_pool1d(x.reshape(1,1,-1), kernel_size=3, stride=1, padding=1)
print(x)
print(y)
print(y2)
print(y3)
tensor([4., 9., 3., 0., 3., 9., 7., 3., 7., 3.])
tensor([[[5.3333, 4.0000, 5.6667]]])
tensor([[[5.3333, 4.0000, 2.0000, 4.0000, 6.3333, 6.3333, 5.6667, 4.3333]]])
tensor([[[4.3333, 5.3333, 4.0000, 2.0000, 4.0000, 6.3333, 6.3333, 5.6667,
          4.3333, 3.3333]]])
  • Is average pooling a linear or non-linear operation?
  • Is max pooling a linear or non-linear operation?

The shape of pooling layers is slightly different than for convolutions

x = torch.randn((3,4,10,20)).float()
print(x.shape, 'N x C x H x W')
y = F.max_pool2d(x, kernel_size=2)
print(y.shape, 'The number of channels does not change for pooling')
y2 = F.max_pool2d(x, kernel_size=2)
print(y2.shape, 'Note that `stride=kernel_size` by default')
y3 = F.max_pool2d(x, kernel_size=2, stride=1)
print(y3.shape, 'Can set stride explicitly to 1')
y4 = F.max_pool2d(x, kernel_size=3, stride=1, padding=1)
print(y4.shape, 'Can produce the same size')
torch.Size([3, 4, 10, 20]) N x C x H x W
torch.Size([3, 4, 5, 10]) The number of channels does not change for pooling
torch.Size([3, 4, 5, 10]) Note that `stride=kernel_size` by default
torch.Size([3, 4, 9, 19]) Can set stride explicitly to 1
torch.Size([3, 4, 10, 20]) Can produce the same size

Convolution Neural Network (CNN) layers are compositions of convolution, activation and pooling

import sklearn.datasets
A = torch.tensor(sklearn.datasets.load_sample_image('flower.jpg')).float()
A = torch.sum(A, dim=2)
filt = torch.tensor([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]).float() # Horizontal
#filt = torch.tensor([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]).float().t() # Vertical
#filt = torch.tensor([[1, -1], [-1, 1]]).float() # Checker board pattern
#filt = torch.ones((10, 10)).float() # Blur
print('Filter')
print(filt)
B = F.conv2d(A.reshape(1, 1, *A.size()), filt.reshape(1, 1, *filt.size()))
print('A size', A.size(), 'B size', B.size())
C = torch.relu(B)
D = torch.max_pool2d(C, kernel_size=20)
#D = torch.max_pool2d(C, kernel_size=20, stride=1)

fig, axes = plt.subplots(2, 2, figsize=(14,8))
axes = axes.ravel()
for im, ax in zip([A, B, C, D], axes):
    ax.imshow(im.squeeze(), cmap='gray')
Filter
tensor([[-1.,  0.,  1.],
        [-1.,  0.,  1.],
        [-1.,  0.,  1.]])
A size torch.Size([427, 640]) B size torch.Size([1, 1, 425, 638])

How could you detect an edge from multiple angles by combining convolutions and ReLUs?

  • Hint: First detect edges from all directions, then combine.
import sklearn.datasets
import torch
import numpy as np
A = torch.tensor(sklearn.datasets.load_sample_image('china.jpg')).float()
A = torch.tensor(sklearn.datasets.load_sample_image('flower.jpg')).float()
A = torch.sum(A, dim=2)

filters = torch.tensor([
    [[[-1, 1], [-1, 1]]],
    [[[1, -1], [1, -1]]],
    [[[1, 1], [-1, -1]]],
    [[[-1, -1], [1, 1]]],
]).float()
B = F.conv2d(A.reshape(1, 1, *A.size()), filters)
C = torch.relu(B)

# Combine
filt = torch.ones(4).float()
D = F.conv2d(C, filt.reshape(1, 4, 1, 1))

fig, axes = plt.subplots(2, 3, figsize=(14,8))
for im, ax in zip([A, *C[0,:,:,:], D], axes.ravel()):
    ax.imshow(im.squeeze(), cmap='gray')

Check out PyTorch tutorial on simple classifier on CIFAR10 dataset:

https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html


Transposed Convolution Can Be Used to Upsample a Tensor/Image to Have Higher Dimensions

  • Also known as:
    • Fractionally-strided convolution
    • Improperly, deconvolution
  • Remember: Convolution is like matrix multiplication \[ y = x * f \iff \text{vec}(y) = A_f \text{vec}(x) \]
  • Transpose convolution is the transpose of \(A_f\): \[ \text{vec}(y) = A_f^T \text{vec}(x) \]

Convolution Operator With Corresponding Matrix

https://github.com/naokishibuya/deep-learning/blob/master/python/transposed_convolution.ipynb


Transposed Convolution Operator With Corresponding Matrix

https://github.com/naokishibuya/deep-learning/blob/master/python/transposed_convolution.ipynb


Transposed Convolution Can Be Equivalent to a Simple Convolution With Zero Rows/Columns Added

(added zeros simulate fractional strides)

Note

Note: More modern upsampling layers upsample by imputing/interpolating non-zeros and then apply convolution.


Computing Tensor Shapes With Transpose Convolutions

  • Channels is computed the same as convolution
  • For spatial dimensions, you switch the input and output dimensions
    • Reason about the standard convolution dimensions
    • And then flip input and output dimensions
  • Like convolutions, output has same height and width as input
    • \(1 \times 1\) convolution with padding=0, stride=1
    • \(3 \times 3\) convolution with padding=1, stride=1
    • (Stride of 1 is equivalent to stride of 1 convolution)
  • Output has double (upsample) the height and width of input
    • \(2 \times 2\) convolution with padding=0, stride=2
    • \(4 \times 4\) convolution with padding=1, stride=2
    • \(6 \times 6\) convolution with padding=2, stride=2

Summary: Convolutional Neural Networks (CNNs)

  • Why CNNs?
    • Neuro-inspired: CNNs learn feature-detecting filters similar to thos in the brain’s visual cortex.
    • Computationally Efficient: They are efficient due to sparse computation (local connections) and parameter sharing (the same filter is used across the input).
  • Core Components
    • Convolution Layer: Applies learnable filters (kernels) to input data to create feature maps. The output shape is controlled by kernel_size, stride, and padding.
    • Activation Function (e.g., ReLU): Introduces essential non-linearity, allowing the network to learn complex, non-linear patterns.
    • Pooling Layer (e.g., Max Pooling): Reduces the spatial dimensions (downsamples) of feature maps, which reduces computational load and provides local invariance.
    • Upsampling with Transposed Convolution: Used to increase the spatial dimensions.