Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)
Solution 1
Okay, got it.
The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that's not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.
Now everything works.
Solution 2
Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024x1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.
Also, is your GPU being detected by TF?
Schütze
Updated on June 27, 2022Comments
-
Schütze almost 2 years
I'm trying to train a Resnet50 but failing no matter what I do since the Jupyter notebook's Kernel is dying (
The kernel appears to have died. It will restart automatically
), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I donvidia-smi
during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.Here are the requirements:
pandas==0.25.1 numpy==1.17.2 opencv-python==4.1.1.26 scikit-image==0.15.0 scikit-learn==0.21.3 tensorflow-gpu==1.14.0 Keras==2.2.5 matplotlib==3.1.1 Pillow==6.1.0 albumentations==0.3.2 tqdm==4.35.0 jupyter
which I satisfy. Here is how I set up the training session:
config = tf.ConfigProto() config.gpu_options.allow_growth = False config.gpu_options.per_process_gpu_memory_fraction = 0.9 sess = tf.Session(config=config) keras.backend.set_session(sess) keras.__version__ os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU. # create the FCN model model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output) model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output) model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output) MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output) MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance]) MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10) Epoch 1/100 1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125
I have tried setting:
config.gpu_options.allow_growth
toTrue
.config.gpu_options.per_process_gpu_memory_fraction
to any other arbitrary value such as0.1
- commenting out:
#os.environ["CUDA_VISIBLE_DEVICES"] = 0
none of them worked. I appreciate constructive answers.
Thanks in advance.
EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:
2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib 2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
which is strange because I don't have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?