Getting the output shape of deconvolution layer using tf.nn.conv2d_transpose in tensorflow

neural-network tensorflow deep-learning conv-neural-network

14,656

Solution 1

for deconvolution,

output_size = strides * (input_size-1) + kernel_size - 2*padding

strides, input_size, kernel_size, padding are integer padding is zero for 'valid'

Solution 2

The formula for the output size from the tutorial assumes that the padding P is the same before and after the image (left & right or top & bottom). Then, the number of places in which you put the kernel is: W (size of the image) - F (size of the kernel) + P (additional padding before) + P (additional padding after).

But tensorflow also handles the situation where you need to pad more pixels to one of the sides than to the other, so that the kernels would fit correctly. You can read more about the strategies to choose the padding ("SAME" and "VALID") in the docs. The test you're talking about uses method "VALID".

Solution 3

This discussion is really helpful. Just add some additional information. padding='SAME' can also let the bottom and right side get the one additional padded pixel. According to TensorFlow document, and the test case below

strides = [1, 2, 2, 1]
# Input, output: [batch, height, width, depth]
x_shape = [2, 6, 4, 3]
y_shape = [2, 12, 8, 2]

# Filter: [kernel_height, kernel_width, output_depth, input_depth]
f_shape = [3, 3, 2, 3]

is using padding='SAME'. We can interpret padding='SAME' as:

(W−F+pad_along_height)/S+1 = out_height,
(W−F+pad_along_width)/S+1 = out_width.

So (12 - 3 + pad_along_height) / 2 + 1 = 6, and we get pad_along_height=1. And pad_top=pad_along_height/2 = 1/2 = 0(integer division), pad_bottom=pad_along_height-pad_top=1.

As for padding='VALID', as the name suggested, we use padding when it is proper time to use it. At first, we assume that the padded pixel = 0, if this doesn't work well, then we add 0 padding where any value outside the original input image region. For example, the test case below,

strides = [1, 2, 2, 1]

# Input, output: [batch, height, width, depth]
x_shape = [2, 6, 4, 3]
y_shape = [2, 13, 9, 2]

# Filter: [kernel_height, kernel_width, output_depth, input_depth]
f_shape = [3, 3, 2, 3]

The output shape of conv2d is

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
           = ceil(float(13 - 3 + 1) / float(3)) = ceil(11/3) = 6
           = (W−F)/S + 1.

Cause (W−F)/S+1 = (13-3)/2+1 = 6, the result is an integer, we don't need to add 0 pixels around the border of the image, and pad_top=1/2, pad_left=1/2 in the TensorFlow document padding='VALID' section are all 0.

14,656

Author by

Xiuyi Yang

Graduate from Dalian University of Technology(DUT, 大连理工大学), major in engineering. Right now, my interest is algebraic geometry and how deep neural network works.

Updated on June 10, 2022

Comments

Xiuyi Yang almost 2 years
According to this paper, the output shape is N + H - 1, N is input height or width, H is kernel height or width. This is obvious inverse process of convolution. This tutorial gives a formula to calculate the output shape of convolution which is (W−F+2P)/S+1, W - input size, F - filter size, P - padding size, S - stride. But in Tensorflow, there are test cases like:
```
  strides = [1, 2, 2, 1]

  # Input, output: [batch, height, width, depth]
  x_shape = [2, 6, 4, 3]
  y_shape = [2, 12, 8, 2]

  # Filter: [kernel_height, kernel_width, output_depth, input_depth]
  f_shape = [3, 3, 2, 3]
```
So we use y_shape, f_shape and x_shape, according to formula (W−F+2P)/S+1 to calculate padding size P. From (12 - 3 + 2P) / 2 + 1 = 6, we get P = 0.5, which is not an integer. How does deconvolution works in Tensorflow?