Hello Everyone,

I’m new to Deep Learning and TensorFlow. From studying tutorials / research papers / online lectures it appears that people always have the execution order: ReLU -> Pooling. But in case of e.g. 2×2 max-pooling it seems that we can save 75%% of the ReLU operations by simply reversing the execution order to: Max-Pooling -> ReLU. This should calculate the exact same thing using only a quarter of the ReLU operations. This reversal of operations can be done in general for max-pooling and all non-decreasing activation functions (which I guess they all are?), but it won’t work for average-pooling.

This is an optimization that TensorFlow could perform automatically when compiling the computation graph. I haven’t quite figured out how to use TensorBoard yet, so I can’t tell if this automatic reversal of ReLU and max-pooling is already being done in TensorFlow.

So I’ve done a few experiments instead, timing the optimization of a convolutional net on MNIST, but the results are inconclusive for the two execution orders. Perhaps this means that TensorFlow already does the reversal automatically, or it means that there’s no consistent advantage because the saved ReLU operations are such a tiny fraction of the overall computational cost, or perhaps it takes much larger images than MNIST and much deeper convolutional networks for the performance difference to become apparent.

Any thoughts?

Such code optimizations are common in e.g. C / C++ compilers. It has probably been 15 years since I’ve implemented a compiler, but as I recall, these kinds of optimizations are called Peephole Optimizations. TensorFlow is a kind of compiler which uses a computational graph, so it would make sense for it to do automatic code optimizations like this.

Regarding the potential time savings, I think more investigation is needed before we rule it out. Let’s do a quick back-of-the-envelope calculation to begin with. This may seem a little confusing and I hope I got the numbers right as I’m still new to these things.

The convolution performs approximately O(input_width * input_height * input_channels * filter_width * filter_height * num_filters) operations for each input image. This results in a tensor having approximately input_width * input_height * num_filters elements, depending on the padding settings. Dividing the two numbers shows that there’s O(filter_width * filter_height * input_channels) arithmetic operations being performed to calculate one element of the output tensor. If we assume the ReLU operations take the same time to execute as each of those convolutional operations, then we should expect to save approximately 0.75 * (1 / (filter_width * filter_height * input_channels)) of the overall computational costs, because the cost of one single ReLU operation is approximately (1 / (filter_width * filter_height * input_channels)) of the total computational cost of the convolutional layer, and we can save 75%% of those operations simply by switching the order of the ReLU operation and the 2×2 Max-Pooling.

For example, having filter_width == filter_height == 5 and input_channels == 1 we would get 0.75 * 0.04 = 0.03, that is, approximately 3%% of the overall computational cost of the convolutional layer would be saved from this simple reversal of ReLU and Max-Pooling. That’s actually a quite nice saving for such a simple code optimization! However, if input_channels == 64 then the saving is only 0.75 * (1 / 1600) which is about 0.0005 or about 0.05%% which is clearly insignificant.

But if more expensive functions are used instead of ReLU, e.g. something with floating-point division which is computationally expensive, then perhaps it would still make sense to do this optimization even if the number of input channels is high.

Another thing to consider is the number of layers in the network, because the number of input channels to a convolutional layer is low in the first layer (e.g. 1-channel for gray-scale images and 3-channels for RGB colours), and the number of channels becomes higher in later layers because of the higher number of filter-channels. So the first layer might actually provide a small but tangible saving to the overall computational cost of the network, simply by switching the order of ReLU and Max-Pooling, while the deeper layers may only provide a tiny and insignificant computational saving. But if this optimization was done transparently by the TensorFlow compiler, then any potential time-saving would be gratis to the user of TensorFlow.

Nevertheless, I find it curious that people in Deep Learning continue to have the ReLU -> Max-Pooling ordering and apparently not realizing that it wastes operations. It suggests that people in the field might have a slightly rigid way of thinking about these things, and perhaps there are more substantial improvements waiting to be discovered.

@Hvass-Labs Your own explanation about the insignificant cost performance is very nice and correct, yet of course it is interesting to wonder if it would make sense to reverse the layer order anyways. IMHO I would expect industry or embedded-level networks to be optimized in such a way when possible, as any “free” computation saving is beneficial.

However, I wanted to point out that some networks

doneed the ReLU to be performed right after the convolution. In semantic segmentation, for example, recurrent CNNs use the “convolution output” (including the ReLU) to feed again later stages of the architecture where an upsampling is required, and the original activations are necessary in the original size, to recover the exact neuron which fired the activation. If the MaxPool was done before the ReLU, this detailed local information would be lost, and therefore it wouldnt be possible to recover an output as big (or almost) as the original image.I know it is not a big deal, and of course this only involves some small subset of CNNs, but hopefully I satisfied a bit of your curiosity! 🙂