Dealing with Boundary conditions / Halo regions in CUDA

image-processing cuda convolution

Solution 1

A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:

A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])

Solution 2

tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.

(Apologies in advance for my reckless abuse of mathematics)

To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.

Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)). Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:

K*w + K*w + K*w  = K*3*w

So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).

Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.

So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.

Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.

Solution 3

Your question is somewhat broad and I believe it mixes two problems:

dealing with boundary conditions;
dealing with halo regions.

The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.

The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.

For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.

Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.

Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.

#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < N) {
        float signal_center = tex1D(signal_texture, tid - 0);
        float signal_before = tex1D(signal_texture, tid - 1);
        float signal_after  = tex1D(signal_texture, tid + 1);
        printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
        d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
    }
}
/********/
/* MAIN */
/********/
int main() {
    const int N = 10;
    // --- Input host array declaration and initialization
    float *h_arr = (float *)malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) h_arr[i] = (float)i;
    // --- Output host and device array vectors
    float *h_vec = (float *)malloc(N * sizeof(float));
    float *d_vec;   gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
    // --- CUDA array declaration and texture memory binding; CUDA array initialization
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    //Alternatively
    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *d_arr;   gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
    gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
    cudaBindTextureToArray(signal_texture, d_arr); 
    signal_texture.normalized = false; 
    signal_texture.addressMode[0] = cudaAddressModeWrap;
    // --- Kernel execution
    median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
    printf("Test finished\n");
    return 0;
}