How fast is my model?

When doing deep learning on mobile devices, how good your model’s predictions are isn’t the only consideration. You also need to worry about:

the amount of space the model takes up in your app bundle — a single model can add 100s of MBs to the download size of your app
the amount of memory it takes up at runtime — on the iPhone and iPad the GPU can use all the RAM in the device, but that’s still only a few GB total, and when you run out of free memory the app gets terminated by the OS
how fast the model runs — especially when working with live video or large images (if the model takes several seconds to process a single image you might be better off with a cloud service)
how quickly it drains the battery — or makes the device too hot to hold!

Authors of academic papers usually don’t worry about these things. They can run their models on fat desktop GPUs or compute clusters. But if you’re interested in converting such a model to run on mobile, you’ll need to have some idea of how fast the model will be on the target device and how much battery power it uses.

The best way to measure the speed of a model is to run it a number of times in a row and take the average elapsed time. The time you measure for any single run may have a fairly large margin of error — the CPU or GPU may be busy doing other tasks (drawing the screen, for example) — but when you average over multiple runs this will significantly shrink that error.

Of course this assumes you already have a model that runs on the device!

It’s quite useful to have some theoretical insight into how well your model will be doing before you start training it, since training is expensive.

Case study: One of my clients recently replaced the MobileNetV1 layers in their model with MobileNetV2 layers. V2 uses a lot fewer computations than V1 so you’d think this change would make the model a lot faster (there were many additional layers in the model but these did not change).

For their V2 layers they used a depth multiplier of 1.4, which adds more filters to each layer, but this still resulted in a network with fewer parameters than before. Even so, I had a hunch that the V2 layers in their particular configuration wouldn’t be much faster than the original V1 layers.

It turns out my hunch was right — the V2 model was in fact slower! In this blog post I’ll show why that is and how to do this kind of math.

Computations

One way to get an idea of the speed of your model is to simply count how many computations it does. We typically count this as FLOPS, floating point operations per second. A slight variation of this is MACCs or multiply-accumulate operations, also known as MADDs.

Note: Before continuing, I have to point out that tallying up the number of calculations by itself isn’t going to tell you everything you need to know. Counting the number of computations is useful only to get a very rough idea of what the computational cost of your model is, but other factors such as memory bandwidth are often more important (we’ll go into this later on).

It’s dot products all the way down

Why multiply-accumulate? Many of the computations in neural networks are dot products, such as this:

y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n-1]*x[n-1]

Here, w and x are two vectors, and the result y is a scalar (a single number).

In the case of a convolutional layer or a fully-connected layer — the two main types of layers in modern neural networks — w would be the layer’s learned weights and x would be the input to that layer.

y is one of the layer’s outputs. Typically a layer will have multiple outputs, and so we compute many of these dot products.

We count w[0]*x[0] + ... as one multiply-accumulate or 1 MACC.

The “accumulation” operation here is addition, as we sum up the results of all the multiplications. The above formula has n of these MACCs.

Therefore, a dot product between two vectors of size n uses n MACCs.

Note: Technically speaking there are only n - 1 additions in the above formula, one less than the number of multiplications. Think of the number of MACCs as being an approximation, just like Big-O notation is an approximation of the complexity of an algorithm.

In terms of FLOPS, a dot product performs 2n - 1 FLOPS since there are n multiplications and n - 1 additions.

So a MACC is roughly two FLOPS, although multiply-accumulates are so common that a lot of hardware can do fused multiply-add operations where the MACC is a single instruction. 🤷‍♂️

Now let’s look at a few different layer types to see how to compute the number of MACCs for these layers.

Fully-connected layer

In a fully-connected layer, all the inputs are connected to all the outputs. For a layer with I input values and J output values, its weights W can be stored in an I × J matrix. The computation performed by a fully-connected layer is:

y = matmul(x, W) + b

Here, x is a vector of I input values, W is the I × J matrix containing the layer’s weights, and b is a vector of J bias values that get added as well. The result y contains the output values computed by the layer and is also a vector of size J.

To compute the number of MACCs, we look at where the dot products happen. For a fully-connected layer that is in the matrix multiplication matmul(x, W).

A matrix multiply is simply a whole bunch of dot products. Each dot product is between the input x and one column in the matrix W. Both have I elements and therefore this counts as I MACCs. We have to compute J of these dot products, and so the total number of MACCs is I × J, the same size as the weight matrix.

The bias b doesn’t really affect the number of MACCs. Recall that a dot product has one less addition than multiplication anyway, so adding this bias value simply gets absorbed in that final multiply-accumulate.

Example: a fully-connected layer with 300 input neurons and 100 output neurons performs 300 × 100 = 30,000 MACCs.

Note: Sometimes the formula for the fully-connected layer is written without an explicit bias value. In that case, the bias vector is added as a row to the weight matrix to make it (I + 1) × J, but that’s really more of a mathematical simplification — I don’t think the operation is ever implemented like that in real software. In any case, it would only add J extra multiplications, so the number of MACCs wouldn’t be greatly affected anyway. Remember it’s an approximation.

In general, multiplying a vector of length I with an I × J matrix to get a vector of length J, takes I × J MACCs or (2I - 1) × J FLOPS.

If the fully-connected layer directly follows a convolutional layer, its input size may not be specified as a single vector length I but perhaps as a feature map with a shape such as (512, 7, 7). Some packages like Keras require you to “flatten” this input into a vector first, so that I = 512×7×7. But the math doesn’t change.

Note: In all these calculations I’m assuming a batch size of 1. If you want to know the number of MACCs for a larger batch size B, then simply multiply the result by B.

Activation functions

Usually a layer is followed by a non-linear activation function, such as a ReLU or a sigmoid. Naturally, it takes time to compute these activation functions. We don’t measure these in MACCs but in FLOPS, because they’re not dot products.

Some activation functions are more difficult to compute than others. For example, a ReLU is just:

y = max(x, 0)

This is a single operation on the GPU. The activation function is only applied to the output of the layer. On a fully-connected layer with J output neurons, the ReLU uses J of these computations, so let’s call this J FLOPS.

A sigmoid activation is more costly, since it involves taking an exponent:

y = 1 / (1 + exp(-x))

When calculating FLOPS we usually count addition, subtraction, multiplication, division, exponentiation, square root, etc as a single FLOP. Since there are four distinct operations in the sigmoid function, this would count as 4 FLOPS per output or J × 4 FLOPS for the total layer output.

It’s actually common to not count these operations, as they only take up a small fraction of the overall time. We’re mostly interested in the (big) matrix multiplies and dot products, and we’ll simply assume that the activation function is free.

In conclusion: activation functions, don’t worry about them.

Convolutional layer

The input and output to convolutional layers are not vectors but three-dimensional feature maps of size H × W × C where H is the height of the feature map, W the width, and C the number of channels at each location.

Most convolutional layers used today have square kernels. For a conv layer with kernel size K, the number of MACCs is:

K × K × Cin × Hout × Wout × Cout

Here’s where that formula comes from:

for each pixel in the output feature map of size Hout × Wout,
take a dot product of the weights and a K × K window of input values
we do this across all input channels, Cin
and because the layer has Cout different convolution kernels, we repeat this Cout times to create all the output channels.

Again, we’re conveniently ignoring the bias and the activation function here.

Something we should not ignore is the stride of the layer, as well as any dilation factors, padding, etc. That’s why we look at the dimensions of the layer’s output feature map, Hout × Wout, since that already has the stride etc accounted for.

Example: for a 3×3 convolution with 128 filters, on a 112×112 input feature map with 64 channels, we perform this many MACCs:

3 × 3 × 64 × 112 × 112 × 128 = 924,844,032

That’s almost 1 billion multiply-accumulate operations! Gotta keep that GPU busy…

Note: In this example, we used “same” padding and stride = 1, so that the output feature map has the same size as the input feature map. It’s also common to see convolutional layers use stride = 2, which would have chopped the output feature map size in half, and we would’ve used 56 × 56 instead of 112 × 112 in the above calculation.

Depthwise-separable convolution

A depthwise-separable convolution is a factorization of a regular convolution into two smaller operations. Together they take up a lot less memory (fewer weights) and are much faster. Naturally, this only approximates what a “full” convolutional layer can do, so you may actually need more of these to get the same expressive power in your model, but even then you still come out ahead.

These kinds of layers work very well on mobile devices and are the foundation of MobileNet, but also of larger models such as Xception.

This first operation is the depthwise convolution. This is in many ways similar to a regular convolution, except that we don’t combine the input channels. There are always the same number of output channels as input channels.

The total number of MACCs for a depthwise convolution is:

K × K × C × Hout × Wout

This does a factor of C less work, making this a lot more efficient than a regular convolutional layer.

Example: a 3×3 depthwise convolution on a 112×112 feature map with 64 input channels, perform this many MACCs:

3 × 3 × 64 × 112 × 112 = 7,225,344

Note that this convolution always has exactly as many filters as there are input channels, and each filter only works on a single channel. That’s why there is no factor × 128 in the above computation.

Note: There is something called the “depthwise channel multiplier”. If this multiplier is greater than 1, there are D output channels for every input channel. So instead of having a single filter per channel, you’d now have D filters per channel. But I haven’t seen this used much in practice.

Just the depthwise convolution alone is not enough, we also need to add the “separable” bit. This second operation is a regular convolution but always using kernel size 1×1, also called a “pointwise” convolution.

For this pointwise convolution layer, the number of MACCs is:

Cin × Hout × Wout × Cout

since K = 1.

Example: let’s take the output from the depthwise convolution, which had a 112×112×64 feature map, and project this into 128 dimensions to create a new 112×112×128 feature map. That would use this many MACCs:

64 × 112 × 112 × 128 = 102,760,448

As you can see, the pointwise operation is many times more expensive than the depthwise one. If we put them together, though, the total number of MACCs is much less than with the regular 3×3 convolution:

3×3 depthwise          : 7,225,344
1×1 pointwise          : 102,760,448
depthwise separable    : 109,985,792 MACCs

regular 3×3 convolution: 924,844,032 MACCs

That’s about 8.4 times fewer computations!

Now, it’s a little unfair to compare these two kinds of layers, since the regular 3×3 convolution is more expressive: it can compute more interesting things. But for the same cost you can use 8 times more of these depthwise-separable layers, or give them more filters, and that will give the regular convolution a run for its money.

Just for completeness’ sake, the total MACCs for a depthwise-separable layer is:

(K × K × Cin × Hout × Wout) + (Cin × Hout × Wout × Cout)

which simplifies to:

Cin × Hout × Wout × (K × K + Cout)

If you compare this to the formula for a regular convolution layer, you’ll notice that the only difference is that originally we did × Cout while here it is + Cout. Doing addition instead of multiplication makes a big difference…

As a quick rule of thumb, using a depthwise-separable layer is almost a factor of K×K less costly than a regular conv layer. In the above example, it was a factor of 8.4, which indeed is almost the same as K × K = 3 × 3 = 9.

Note: The exact factor is: K × K × Cout / (K × K + Cout).

I should also point out that depthwise convolutions sometimes have a stride > 1, which reduces the dimensions of their output feature map. But a pointwise layer usually has stride = 1, and so its output feature map will always have the same dimensions as the depthwise layer’s.

Depthwise-separable layers are the main building block in MobileNet V1. However, MobileNet V2 shakes things up a little and uses an “expansion block” consisting of three layers:

a 1×1 convolution that adds more channels to the feature map (known as the “expansion” layer)
a 3×3 depthwise convolution that filters the data
a 1×1 convolution that reduces the number of channels again (the “projection” layer, which acts as a bottleneck convolution)

Just for the sake of completeness, here is the formula for the number of MACCs in such an expansion block:

Cexp = (Cin × expansion_factor)

expansion_layer = Cin × Hin × Win × Cexp

depthwise_layer = K × K × Cexp × Hout × Wout

projection_layer = Cexp × Hout × Wout × Cout

These are the same formulas I gave earlier. expansion_factor is used to create the extra channels for the depthwise layer to work on, making Cexp the number of channels used inside this block.

Note: The output dimensions of the expansion layer are Hin × Win because a 1×1 convolution doesn’t change the feature map width and height. But if the depthwise layer has stride > 1, then Hout × Wout will be different from Hin × Win.

Putting all of this together:

Cin × Hin × Win × Cexp + (K × K + Cout) × Cexp × Hout × Wout

And if the stride = 1, it simplifies to:

(K × K + Cout + Cin) × Cexp × Hout × Wout

How does this compare to a depthwise-separable layer used by V1? If we use 112×112×64 as the input feature map, an expansion factor of 6, and a 3×3 depthwise convolution with stride = 1, and 128 output channels, then the total number of MACCs is:

(3 × 3 + 128 + 64) × (64 × 6) × 112 × 112 = 968,196,096

Isn’t that actually a lot more than before? Yep, it’s even more than the original 3×3 convolution. However… note that due to the expansion layer, inside this block we actually work with 64 × 6 = 384 channels. So this group of layers does a lot more than the original 3×3 convolution did (which “only” went from 64 to 128 channels), at roughly the same computational cost.

Batch normalization

I’ve mentioned that we don’t really count activation functions but what about batch normalization? In modern networks it’s common to include a batchnorm layer after every convolutional layer.

Batch normalization takes the output of a layer and applies the following formula to every single output value:

z = gamma * (y - mean) / sqrt(variance + epsilon) + beta

Here, y is an element in the output feature map from the previous layer. We first normalize this value by subtracting the mean for that output channel and dividing by the standard deviation (epsilon is used to make sure we don’t divide by 0 and usually is something like 0.001). Then we scale by some factor gamma and add a bias or offset beta.

Each channel has its own gamma, beta, mean, and variance values, so if there are C channels in the convolution layer’s output, then the batch normalization layer has learned C×4 parameters.

It seems like this would be quite a few FLOPS, since the above formula is applied to every single element in the output feature map.

However… often the batch normalization is applied to the output of a convolution layer but before the non-linearity (ReLU). In that case, we can do a bit of math to make the batch norm layer disappear!

Since convolution or the matrix multiplication that is done in the fully-connected layer is just a bunch of dot products, which are a linear transformation, and the batch norm formula given above is also a linear transformation, we can combine these two formulas into a single transformation.

In other words, we can “fold” the batch norm layer’s learned parameters into the weights of the preceding convolution / fully-connected layer.

The math for folding the batch norm parameters into the weights of the preceding layers is fairly straightforward. In the above formula, y meant a single output value from the previous layer. Let’s expand y into the calculation it came from:

z = gamma * ((x[0]*w[0] + x[1]*w[1] + ... + x[n-1]*w[n-1] + b) - mean) 
      / sqrt(variance + epsilon) + beta

As usual this is a dot product, either from the convolution kernel or from a matrix multiply. As usual, x means the input data, w are the weights for that layer, and b is a bias value for that layer.

To fold the batch norm parameters into the previous layer, we want to rewrite this equation so that gamma, beta, mean, and variance only apply to w and b but don’t have x in them. After a bit of fiddling with the formulas this gives us:

w_new[i] = w[i]       * gamma / sqrt(variance + epsilon)
b_new    = (b - mean) * gamma / sqrt(variance + epsilon) + beta

Here, w_new[i] is the new value for the i-th weight, and b_new is the new value for the layer’s bias.

From now on, we can use these values for the weights of the convolutional or fully-connected layer. We can now write,

z = x[0]*w_new[0] + x[1]*w_new[1] + ... + x[n-1]*w_new[n-1] + b_new

and this gives the exact same result as before, but without having to use the batch norm layer.

If you don’t believe me, substitute the values of w_new and b_new in the above formula and simplify. You should get the original batch norm formula again. 😀

Note that layers that are immediately followed by batchnorm often don’t have a bias b themselves, since the batchnorm layer already provides one (beta). In that case the formula for b_new becomes a little simpler (we set b to 0):

b_new = beta - mean * gamma / sqrt(variance + epsilon)

So even if the original layer does not have a bias, it will get one anyway courtesy of the folded batch norm layer.

Long story short: we can totally ignore the influence of the batch norm layer, since we actually remove it from the model when doing inference.

Note: This trick only works when the order of the layers is: convolution, batch norm, ReLU — but not when it is: convolution, ReLU, batch norm. The ReLU is a non-linear operation, which messes up the math. (Although I suppose if a batch norm is immediately followed by a new convolution layer, you could fold the parameters the other way around. Anyway, your deep learning library will already make these kinds of optimizations for you.)

Other layer types

We’ve looked at convolution layers and fully-connected layers, arguably the most important components of modern neural networks. But there are other types of layers too, such as pooling layers.

These other layer types certainly take up time but they don’t use dot products and so MACCs are not a good measurement. If you’re interested in counting the FLOPS, just take the feature map size and multiply it by some constant that represents the difficulty of handling a single input element.

Example: a max pooling layer with filter size 2 and stride 2 on a 112×112 feature map with 128 channels uses 112 × 112 × 128 = 1,605,632 FLOPS or 1.6 mega FLOPS. Of course, if the stride is different from the filter size (such as a 3×3 window with 2×2 stride) then these numbers will change a little.

However, often these additional layers will simply be ignored when determining the complexity of the network. After all, 1.6 MFLOPS is pretty small compared to the 100s of MFLOPS for the convolution / fully-connected layers. It becomes a rounding error on the total computational complexity of the network.

Some kinds of operations, such as concatenation of results, can often even be done for free. Instead of two layers writing into their own output tensor and then having a concat layer that copies these two into one big tensor, the first layer could directly write into the first half of the big tensor, and the second layer into the second half. No separate copying step is necessary.

Note: In this discussion I’m currently ignoring recurrent neural networks (RNNs), which often consists of LSTM or GRU layers. In a previous post I explained the math for an LSTM layer. It involves doing two big matrix multiplies, a few sigmoids, a tanh, and some element-wise multiplies. Essentially it’s the same as 2 fully-connected layers, and so the number of MACCs primarily depend on the size of the input and output vectors, as well as the size of the hidden state vectors. Again, it’s the dot products from the matrix multiplies that counts the most.

Memory

The number of computations — whether you count them as MACCs or FLOPS — is only part of the story. Memory bandwidth is the other part, and most of the time is even more important!

On current computer architectures, a single memory access from main memory is much slower than a single computation — by a factor of about 100 or more!

You just saw that these neural networks do a lot of computations, but how many memory accesses do they perform?

For each layer, the device needs to:

read the input vector or feature map from main memory
compute the dot products — which involves reading the layer’s weights from main memory too
write the result back to main memory as a new vector or feature map.

This involves a lot of memory accesses. Since memory is pretty slow, the amount of memory read/writes performed by the layer will have a big impact on its speed too — bigger perhaps than the number of computations.

Memory for weights

Layers store their learned parameters, or weights, in main memory. In general, the fewer weights the model has, the faster it runs.

As we’ve discussed, a fully-connected layer keeps its weights in a matrix of size I × J, where I is the number of input neurons and J the number of outputs. It also has a bias vector of size J. So in total there are (I + 1) × J weights in this layer.

Most convolutional layers used today have square kernels, so for a convolutional layer with kernel size K and Cin input channels, there are K × K × Cin weights per filter. The layer will have Cout filters / output channels, so the total number of weights is K × K × Cin × Cout plus an additional Cout bias values.

In general, convolutional layers have way less weights than fully-connected layers.

Example: a fully-connected layer with 4096 inputs and 4096 outputs has (4096+1) × 4096 = 16.8M weights. A convolutional layer with a 3×3 kernel and 48 filters that works on a 64 × 64 input image with 32 channels, has 3 × 3 × 32 × 48 + 48 = 13,872 weights.

Note that the input to the convolutional layer in this example is actually 32 times larger than the fully-connected layer’s input, and the output is 48 times larger. So the conv layer works on more data but has 1000× fewer weights. It’s obvious that fully-connected layers are memory hogs!

Note: Fully-connected and convolutional layers are actually very similar. You can implement the one using the other, and vice versa. A convolutional layer is basically a fully-connected layer with the vast majority of the connections set to 0 — each output is only connected to K × K inputs rather than all of them, and all the outputs use the same values for these connections. This is why convolutional layers are so much more efficient about memory, since they don’t store the weights for connections that are not used.

Lower precision weights

The size of the individual weights also matters. Desktop-class machines use 32-bit floats, which take up 4 bytes each. On iOS, it’s more common to use 16-bit floats (“half precision”), which only take up 2 bytes each. They have much less precision, but on the upside they are faster, especially because iPhone and iPad GPUs only have 16-bit ALUs. But it’s possible to go even lower than that, using 8-bit weights or even 1-bit weights.

It’s also important to make a distinction between the storage format of the weights versus the format that is used for computation. If you’re storing weights as 8-bit quantized values, the GPU kernel will first convert them back to floats and then do the computation using floating-point values anyway. (Although some toolkits have convolution layers that can work directly with quantized numbers.)

The precision of the accumulation that happens during the calculation of the dot product is also important. Even with 16-bit floats, it makes sense to perform the dot product with 32-bit floats, and then convert the result back to 16-bits. This way you don’t lose any precision while adding up the numbers. But it’s also slower than doing the accumulation with 16-bit numbers.

Reading memory is slow, and therefore a layer with fewer weights will be faster than the same kind of layer that has more weights. Not only because it has fewer MADDs, but also because it has to access main memory less often to read the weights.

For two layers with the same number of weights, but one is using float32 and the other float16, the one with the smaller weight sizes will be faster, but at the cost of accuracy.

In practice, 16-bit floats are good enough for convolutional neural networks. You lose a fair bit of precision but on average these precision errors cancel out and the model will still give the right results.

(Recently I ran into a problem where the variance of one batch norm layer was extremely large. When folded into the convolution layer’s weights, this made the weights smaller than the precision available with 16-bit floats, effectively setting them to zero. No surprise that this neural network did not work very well with 16-bit floats. That was a fun one to debug. 🤓)

Feature maps and intermediate results

In the literature you’ll often see the complexity of a model listed as the number of MACCs (or FLOPS) and the number of trained parameters. However, this leaves out an important metric: the amount of memory that’s being read for the layer’s inputs, and the number of memory accesses performed for writing the layer’s outputs.

I’m going to assume here that reading a single input value counts as “one memory access”, and writing a single output value also counts as one memory access. This isn’t necessarily true in practice: Metal on iOS reads and writes memory values 4 at a time (due to the way the data is stored into textures).

But that shouldn’t affect the calculations in this section. I just want to come up with some number that describes the amount of memory accesses for a given model, so that we can get a sense of many memory accesses one model makes vs. another.

Again, these numbers will just be approximations, since we don’t know exactly how the GPU kernels work anyway.

Note: Like CPUs, GPUs can also do caching to speed up memory reads and writes. However, I’m not sure if Apple’s GPUs do caching, and if they do it’s not going to be major amounts (kilobytes rather than MBs). More important is proper memory coalescing, which means that GPU threads read memory that is in the same neighborhood. That way the GPU can read a chunk of memory in one go instead of doing separate reads for each thread. GPU kernels can also read small amounts of memory into local or “threadgroup” storage for faster access. In this section I’m assuming the GPU kernels were optimized for reading and writing memory as efficiently as possible, and so the numbers presented here are a theoretical upper bound, not exact figures.

Let’s say the input shape for a convolutional layer is 224×224×3, a typical size for an image classifier. That is 150,528 memory accesses just to read all the input values once. However, if the convolution kernel is 3×3 then we actually need to read each input value 9 times for every output value!

And because we don’t have just one convolution filter in the layer but Cout of them, we’ll actually be reading each input pixel K × K × Cout times.

(A smart GPU kernel programmer will have ways to optimize this. It’s possible for each GPU thread to compute multiple output pixels instead of just one, allowing it to re-use some of the input values several times, requiring fewer memory reads overall. However, all these optimizations will be applied to all models equally. So even if my formulas are not 100% correct, they will only be wrong by a constant factor, and are therefore still useful for comparing models.)

If this particular convolutional layer has stride 2 and 32 filters, it writes an output feature map of 112×112×32 values. That is another 401,408 memory accesses.

In general, each layer will make total memory accesses of:

input = Hin × Win × Cin × K × K × Cout
output = Hout × Wout × Cout
weights = K × K × Cin × Cout + Cout

For the example layer that is:

input = 224 × 224 × 3 × 3 × 3 × 32 = 43,352,064
output = 112 × 112 × 32 = 401,408
weights = 3 × 3 × 3 × 32 + 32 = 896
total = 43,754,368

Note that the weights here are negligible. By the way, I’m assuming that the weights will be read once and cached in local GPU memory, so they can be shared between GPU threads, and will be re-used for every output pixel.

For a layer deeper in the network, with 256 input and 512 output channels, the numbers may be more like this:

input = 28 × 28 × 256 × 3 × 3 × 512 = 924,844,032
output = 28 × 28 × 512 = 401,408
weights = 3 × 3 × 256 × 512 + 512 = 1,180,160
total = 926,425,600

Even though the feature maps are now smaller in width and height, they will have more channels. That’s why the weights start to count more as well, as there will be more and more of them due to the increased number of channels.

This is why it’s a good idea to use depthwise-separable layers. Let’s take the same input and output sizes but compute the number of memory accesses for a 3×3 depthwise convolution layer followed by a 1×1 pointwise layer:

depthwise layer
input = 28 × 28 × 256 × 3 × 3 = 1,806,336
output = 28 × 28 × 256 = 200,704
weights = 3 × 3 × 256 + 256 = 2,560
total = 2,009,600

pointwise layer
input = 28 × 28 × 256 × 1 × 1 × 512 = 102,760,448
output = 28 × 28 × 512 = 401,408
weights = 1 × 1 × 256 × 512 + 512 = 131,584
total = 103,293,440

total of both layers = 105,303,040

You’ve see this does about 8.4 times fewer computations, and it also accesses way less memory by a factor of about 8.8 (again almost K × K). The depthwise layer is so cheap it almost doesn’t count.

Fusion

In training packages like Keras, you’ll often see that a Conv2D layer is followed by an Activation layer that applies the ReLU. That’s fine for Keras, but it’s wasteful to make ReLU a separate layer, especially since the function is so simple.

Example: apply ReLU on the 28 × 28 × 512 output from a convolution layer:

input = 28 × 28 × 512 = 401,408
output = 28 × 28 × 512 = 401,408
total = 802,816

First it needs to read the output values from the convolution layer, apply the ReLU, and write the results back to main memory. Granted, this is pretty fast as it’s almost the same as copying the data from one memory location to another, but it is also a bit of wasted effort.

For this reason, activation functions are typically fused with the convolution layer. This means that the convolution layer directly applies the ReLU after it computed the dot products, before it writes out the final result.

This saves the expensive step of reading and writing memory twice.

This fusion thing also explains why with Metal you cannot supply your own custom activation functions. You cannot modify Apple’s shader source code to call your function, so you have no choice but to write a separate compute kernel and do these extra memory reads and writes. It’s not a massive problem, but nice if you can avoid it (so stick to the built-in non-linearities).

MobileNet V2 versus V1

In the introduction I mentioned that MobileNet V2 with depth multiplier 1.4 ran about as fast as V1, even though it has fewer parameters.

I should be a little bit more specific about the actual use case. My client was using MobileNet as a feature extractor for a larger model and did not use all the layers. V1 was used up to layer conv_pw_11, V2 up to layer expanded_conv_12.

In case you’re not intimately familiar with these architectures, that is 23 layers for the version with MobileNet V1 and 47 layers for the version with V2. The fact that V2 has many more layers should already make you suspicious!

The input images were not square but 126 × 224, so they have the same aspect ratio as frames from the camera, which are 720 × 1280.

In that configuration:

MobileNet V1 parameters (multiplier = 1.0): 1.6M
MobileNet V2 parameters (multiplier = 1.0): 0.5M
MobileNet V2 parameters (multiplier = 1.4): 1.0M

I’m also including V2 with multiplier 1.0 in these results, just to show the influence of the depth multiplier. Note that V2 with multiplier 1.4 has more than 1.4 times the number of parameters of the multiplier 1.0 version, in fact about 2 times more. The parameter count doesn’t scale linearly with the depth multiplier.

Also, if we compare the number of MACCs between the two models, V2 also does better on paper:

MobileNet V1 MACCs (multiplier = 1.0): 255M
MobileNet V2 MACCs (multiplier = 1.0): 111M
MobileNet V2 MACCs (multiplier = 1.4): 214M

Recall that the MACCs are dependent on the size of the input image. You will get different results here for 224 × 224 images. Still, V2 does less work than V1, even with a large depth multiplier.

Since it turned out that the larger V2 model wasn’t any faster than V1 in practice, what we should see if we look at the number of memory accesses in each model, is that MobileNet V2 does more of them than V1.

The explanation for this is that V1 uses depthwise-separable convolutions, which consist of two layers, while V2 uses expansion blocks that contain three layers (and also has a few element-wise addition layers for dealing with residual connections).

What this means is that, due to having many more layers, V2 spends more time reading and writing intermediate results to/from main memory. Any gains it has in fewer MACCs and parameters are undone by these extra memory accesses.

Here are the numbers I computed for the network architectures used:

MobileNet V1 memory accesses (multiplier = 1.0): 283M
MobileNet V2 memory accesses (multiplier = 1.0): 159M
MobileNet V2 memory accesses (multiplier = 1.4): 286M

As you can see, MobileNet V2 with a 1.4 multiplier accesses more memory than V1. So even though the new model uses fewer parameters, and performs fewer computations, it still runs at about the same speed as V1 (slightly slower, in fact).

This provides some proof for my hypothesis that the amount of memory accesses is the primary factor for determining the speed of the neural net.

Reducing the number of computations is important too, of course, but if the choice is between doing more work versus accessing more memory, then going for more work is preferable.

Note: The above analysis is only valid when using V2 with a depth multiplier of 1.4. When using the default value of 1.0, it definitely is faster than V1. Also, the advantage of using a larger depth multiplier is that the model becomes more accurate. So the whole thing is not a loss: with V2 and depth multiplier = 1.4 you get model that still runs pretty fast but is a whole lot more powerful. I mentioned that it has 47 layers, which is more than double than in the V1 feature extractor used by my client. Those extra layers give the model a lot more capacity to learn things.

The above results got me curious about VGG16, which is also often used as a feature extractor. It uses regular convolutional layers, and it has fewer of them, so in theory this does more work and accesses less memory. According to what I just claimed, this “ought to” perform better than a model with fewer computations but more memory accesses. But does it?

Here are the results for VGG16 on an input image of size 126×224 (without the fully-connected layers at the end).

VGG16 parameters:        15M
VGG16 MACCs:           8380M
VGG16 memory accesses: 8402M

Welp, that’s no good. It turns out that, even though VGG has fewer layers, it works on much larger feature maps, and so it accesses a ton of memory anyway.

I hope this shows that all these things — number of computations, number of parameters, and number of memory accesses — are deeply related. A model that works well on mobile needs to carefully balance those factors.

Note: And remember, a lot of these numbers are approximate and are only useful when you’re comparing the relative performance of two models on similar hardware with a similar software stack. By themselves, or taken out of context, these numbers are pretty meaningless. But it certainly helped to explain why the model with MobileNet V2 layers was not any faster than the older version with V1 layers, even though it looked faster on paper because it has fewer trained parameters and MACCs.