Clearing Tensorflow GPU memory after model execution
Solution 1
A git issue from June 2016 (https://github.com/tensorflow/tensorflow/issues/1727) indicates that there is the following problem:
currently the Allocator in the GPUDevice belongs to the ProcessState, which is essentially a global singleton. The first session using GPU initializes it, and frees itself when the process shuts down.
Thus the only workaround would be to use processes and shut them down after the computation.
Example Code:
import tensorflow as tf
import multiprocessing
import numpy as np
def run_tensorflow():
n_input = 10000
n_classes = 1000
# Create model
def multilayer_perceptron(x, weight):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, weight)
return layer_1
# Store layers weight & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
pred = multilayer_perceptron(x, weights)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(100):
batch_x = np.random.rand(10, 10000)
batch_y = np.random.rand(10, 1000)
sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
print "finished doing stuff with tensorflow!"
if __name__ == "__main__":
# option 1: execute code with extra process
p = multiprocessing.Process(target=run_tensorflow)
p.start()
p.join()
# wait until user presses enter key
raw_input()
# option 2: just execute the function
run_tensorflow()
# wait until user presses enter key
raw_input()
So if you would call the function run_tensorflow()
within a process you created and shut the process down (option 1), the memory is freed. If you just run run_tensorflow()
(option 2) the memory is not freed after the function call.
Solution 2
You can use numba library to release all the gpu memory
pip install numba
from numba import cuda
device = cuda.get_current_device()
device.reset()
This will release all the memory
Solution 3
I use numba to release GPU. With TensorFlow, I cannot find an effective method.
import tensorflow as tf
from numba import cuda
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.1),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
i=1
while(i<1000):
i=i+1
print(sess.run(c))
sess.close() # if don't use numba,the gpu can't be released
cuda.select_device(1)
cuda.close()
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
while(1):
print(sess.run(c))
Solution 4
Now there seem to be two ways to resolve the iterative training model or if you use future multipleprocess pool to serve the model training, where the process in the pool will not be killed if the future finished. You can apply two methods in the training process to release GPU memory meanwhile you wish to preserve the main process.
- call a subprocess to run the model training. when one phase training completed, the subprocess will exit and free memory. It's easy to get the return value.
- call the multiprocessing.Process(p) to run the model training(p.start), and p.join will indicate the process exit and free memory.
Here is a helper function using multiprocess.Process which can open a new process to run your python written function and reture value instead of using Subprocess,
# open a new process to run function
def process_run(func, *args):
def wrapper_func(queue, *args):
try:
logger.info('run with process id: {}'.format(os.getpid()))
result = func(*args)
error = None
except Exception:
result = None
ex_type, ex_value, tb = sys.exc_info()
error = ex_type, ex_value,''.join(traceback.format_tb(tb))
queue.put((result, error))
def process(*args):
queue = Queue()
p = Process(target = wrapper_func, args = [queue] + list(args))
p.start()
result, error = queue.get()
p.join()
return result, error
result, error = process(*args)
return result, error
Solution 5
I am figuring out which option is better in the Jupyter Notebook. Jupyter Notebook occupies the GPU memory permanently even a deep learning application is completed. It usually incurs the GPU Fan ERROR that is a big headache. In this condition, I have to reset nvidia_uvm and reboot the linux system regularly. I conclude the following two options can remove the headache of GPU Fan Error but want to know which is better.
Environment:
- CUDA 11.0
- cuDNN 8.0.1
- TensorFlow 2.2
- Keras 2.4.3
- Jupyter Notebook 6.0.3
- Miniconda 4.8.3
- Ubuntu 18.04 LTS
First Option
Put the following code at the end of the cell. The kernel immediately ended upon the application runtime is completed. But it is not much elegant. Juputer will pop up a message for the died ended kernel.
import os
pid = os.getpid()
!kill -9 $pid
Section Option
The following code can also end the kernel with Jupyter Notebook. I do not know whether numba is secure. Nvidia prefers the "0" GPU that is the most used GPU by personal developer (not server guys). However, both Neil G and mradul dubey have had the response: This leaves the GPU in a bad state.
from numba import cuda
cuda.select_device(0)
cuda.close()
It seems that the second option is more elegant. Can some one confirm which is the best choice?
Notes:
It is not such the problem to automatically release the GPU memory in the environment of Anaconda by direct executing "$ python abc.py". However, I sometimes need to use Jyputer Notebook to handle .ipynb application.
Related videos on Youtube
Comments
-
David Parks almost 2 years
I've trained 3 models and am now running code that loads each of the 3 checkpoints in sequence and runs predictions using them. I'm using the GPU.
When the first model is loaded it pre-allocates the entire GPU memory (which I want for working through the first batch of data). But it doesn't unload memory when it's finished. When the second model is loaded, using both
tf.reset_default_graph()
andwith tf.Graph().as_default()
the GPU memory still is fully consumed from the first model, and the second model is then starved of memory.Is there a way to resolve this, other than using Python subprocesses or multiprocessing to work around the problem (the only solution I've found on via google searches)?
-
Yaroslav Bulatov over 7 yearsWhat if you delete the session (del sess)? That should have the same effect on memory as restarting process
-
etarion over 7 yearsShouldn't sess.close() (or using the Session as a context with
with
) also work? -
David Parks over 7 yearsI wish, I do use
with ... sess:
and have also triedsess.close()
. GPU memory doesn't get cleared, and clearing the default graph and rebuilding it certainly doesn't appear to work. That is, even if I put 10 sec pause in between models I don't see memory on the GPU clear withnvidia-smi
. That doesn't necessarily mean that tensorflow isn't handling things properly behind the scenes and just keeping its allocation of memory constant. But I'm having troubles validating that line of reasoning. -
Yaroslav Bulatov over 7 years
nvidia-smi
doesn't correctly report amount of memory available to TensorFlow. When TensorFlow computation releases memory, it will still show up as reserved to outside tools, but this memory is available to other computations in tensorflow -
David Parks over 7 years@YaroslavBulatov I've done more testing and confirmed that tensorflow is performing as expected on the 2nd and 3rd models after simply resetting the default graph. If you post that as an answer I'll accept it as correct. It seems that this question is irrelevant, though probably commonly asked so worth keeping open.
-
-
Fedor Chervinskii almost 7 yearsAs for now, tensorflow still doesn't release GPU memory with sess.Close() or after with tf.Session() as sess: , could you please update your answer considering comments above?
-
Diego Aguado almost 7 years@yaroslav-bulatov you mentioned on your comments that
nvidia-smi
doesn't show the correct memory on a gpu. I triedtf.reset_default_graph()
and then rebuild the previous graph but I have an OOM error which suggests thatnvidia-smi
is displaying correctly the memory. Any thoughts? -
Yaroslav Bulatov almost 7 years@DiegoAgher what I meant that nvidia-smi may show 0 available memory, yet there's still plenty of memory available for TensorFlow to use. The reason is that TensorFlow takes over the memory management
-
Diego Aguado almost 7 years@yaroslavBulatov so how would you go about freeing up space in the gpu if the tensorflow pool is still there and when building again the graph, I get the OOM error ?
-
Yaroslav Bulatov almost 7 yearsIt's freed up automatically. OOM error in tensorflow is typically caused by having models that are too large
-
Fosa almost 7 yearsI also have the OOM error which seems to be due to variables not being released. For example the model will run and train several times, but after reassigning the variable (not changing the total size), it may give an OOM error. Closing Spyder and reopening has been my only recourse..
-
Ben Usman almost 6 yearsI wrote a small reusable wrapper that uses same trick as in this answer. However, performance degradation is severe, which is okay for small computations (i.e. inference on a small dataset), but not practical in any other scenario. I believe this must be due to inter-process communication and passing large numpy objects back and forth.
-
Igor almost 6 yearsI'm not sure I get the difference between the two methods you are talking about. They both look like just "use multiprocessing" to me. And there's already a nice and more detailed answer about it.
-
liviaerxin almost 6 yearsIn my sense, 'multiprocessing' and 'subprocess', they both spawn the new process to handle the GPU run and free but operate in different ways
-
guillefix over 5 yearsRunning this code gives me an
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
preceded byFailed precondition: Failed to memcopy into scratch buffer for device 0
, when I try to run thetf.Session()
after thecuda.close(1)
-
Austin over 5 yearsCan I use this to free GPU memory after loading a Keras model?
-
Neil G almost 5 yearsThis leaves tensorflow in a bad state
-
mradul dubey about 4 yearsThis leaves the GPU in a bad state.
-
Scott White almost 4 yearsWould you mind explaining what you mean by "leaves the GPU in a bad state"? That doesn't tell us the ramifications of using this approach.
-
mradul dubey almost 4 yearsRelated and important, multiprocessing.Process uses
spawn
as default on Windows, but,fork
on *nix systems. If you find yourself in a situation where the model running in a separate Process is unable to use GPU i.e.tf.test.is_gpu_available
isFalse
while checking cross platform compatibility, you can force select the state usingmultiprocessing.get_context('spawn')
.spawn
is available for Windows, Linux and MacOS. More on context here -
Mike Chen almost 4 yearsMy test shows that numba is a better choice. However, users need to use pip install numba rather than conda install -c numba mumba or sudo apt-get install python-3 numba. conda install... has an internal conflict and sudo apt-get install..could not be used.
-
Hagbard over 3 yearsI guess what he means is that this results in an " .\tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle" error. At least, that's the case for me.
-
Foivos Ts over 3 yearsThis is a great answer.
-
Michael Malak over 3 yearsShould be noted that is multiprocessing.Queue and not queue.Queue
-
Joe Huang over 3 yearsSo how to make multiprocessing works for jupyter notebook?
-
Jed over 3 yearsI can't get this to work, nor the small reusable wrapper listed by Ben Usman. The problem is that the parallel_wrapper is not picklable if using 'spawn' and the process hangs if using 'fork'. It's hard to believe that after nearly 4 years this is still such an issue with TF. does anybody know of a good resolution?
-
Jed over 3 yearsperhaps by bad state, he means that this kills the kernel. this cannot be done in the midst of long processes
-
Hemanth Kollipara over 2 yearsThis results in
Could not synchronize CUDA stream: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
error if run any code after performing thecuda.close()
, is there any way to clear the CUDA memory without getting this error -
mastDrinkNimbuPani about 2 yearsI used the First Option a few times, and it worked well for me. Thanks. I tested on Tensorflow 2.4.