Please consider "optimization" of this and enable the use GPU shared memory. The vDNN seems to me as a good option when you want to also use a GPU shared RAM memory, which is not used now. The DL training process could use it at the same time and so improve the overall precision of the model, with a just a small reduce of the performance.
In our case the GPU dedicated memory is still on top of usage, while the GPU shared memory on RAM having still 24GB free, and just waiting there.
I think such optimization would increase a usage of DL technology among users who have now limited HW capabilities.