Squeezing Memory out of Caffe

No, I am not talking about thinking of the good old days with a cup of drink.

Today we are talking about our recent release of a new feature in our modified version of deep learning toolbox, Caffe.


In this release, we have implemented the functionality usually referred to as “memory multiloading”. A Wiki page is set up to describe how to use feature


The new feature can drastically save the memory footprints when working with deep neural nets. Particularly, this is about half in training and around 95% memory usage in testing compared with the official version. So how is this done?

The core idea behind this is very intuitive. If we have some consumers requesting for some resources, and they are not requesting all resources all at the same time. It would be nice to have them only occupy some resource when they really need it and return it as soon as they finish. In terms of memory usage in training deep neural networks, the resources here are the limited amount of GPU memory onboard. And the consumers are the layers/operations in a network, which store their intermediate results on GPU.

For example, in training CNNs with the standard backpropagation algorithm, one part of the major memory consumption is caused by storing the intermediate activation values and the gradients values w.r.t. intermediate activation, a.k.a. sensitivities. If we consider one iteration of forward/backward as a cycle, one important observation is that: these values need not be stored during the whole cycle.

In particular, for a single layer, the input activation values become useless after we get the output activation of this layer in the feed-forward evaluation cases. For the backpropagation, the input activation values and the sensitivities become useless once we get the gradients for the layer parameters and the input sensitivities. This means that every memory block only needs physical memory when they are computed and can release the physical memory as soon as they will no longer be used.

Now the core idea we described before is realized in this observation. We now have the possibility to share some storage resources and save memory! In other toolboxes, like TensorFlow or our much faster and more efficient toolbox, Parrots, this is implemented by dynamically scheduling a pre-allocated memory pool as the OS does with the physical RAM. But how to achieve this with Caffe? Caffe has quite a different memory model, where every memory block is allocated autonomously from the OS/GPU. It is very disturbing to hook this behavior to an underlying memory pool.

Instead we choose an alternative approach. Assuming that we will fix the architecture of the network (in most cases this is true), we can now determine the life-cycle of a memory block before we run the first iteration. This leads to a process which we called “dry-run”, in which we simulate the computation flow to find which SyncedMem-s ( the basic memory unit in Caffe) can safely share they underlying storage. Once we figure out this, we can assign them to a single memory slot. In real training/testing, the memory multiloading is achieved by having these shared SyncedMem-s all writing to and reading from the same piece of memory. Because we have figured out the dependencies in the “dry-run” process, the multiloading is safe from data corruption and will save us a lot of memory usage.

Finally, I would like to note an interesting problem we found when implementing this functionality in Caffe. Historically and practically, Caffe has some layers which do no computation but only share the underlying storage of their input and output blobs (ShareData/ShareDiff). This sometimes breaks the apparent dependencies we obtained from the dry-run process. So we first make these layers tell the framework that they doing this. In certain cases, these layers are stacked to build some specific architectures. This makes the problem even trickier by introducing the recursive sharing (a series of blobs sharing their activation/sensitivity memory blocks). At the end of the day, we used the famous “union-find” (disjoint-set) data structure to model this behavior and make the implementation clean and elegant.

Well, it is a little bit wearing only reading the text without figures (mainly due to my laziness). The code is publicly available on our Github page. The functionality described can be found in the following file.