Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)

python tensorflow keras deep-learning jupyter-notebook

11,764

Solution 1

Okay, got it.

The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that's not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.

Now everything works.

Solution 2

Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024x1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.

Also, is your GPU being detected by TF?

11,764

Author by

Schütze

Updated on June 27, 2022

Comments

Schütze almost 2 years

I'm trying to train a Resnet50 but failing no matter what I do since the Jupyter notebook's Kernel is dying (The kernel appears to have died. It will restart automatically), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I do nvidia-smi during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.

Here are the requirements:

pandas==0.25.1
numpy==1.17.2
opencv-python==4.1.1.26
scikit-image==0.15.0
scikit-learn==0.21.3
tensorflow-gpu==1.14.0
Keras==2.2.5
matplotlib==3.1.1
Pillow==6.1.0
albumentations==0.3.2
tqdm==4.35.0
jupyter

which I satisfy. Here is how I set up the training session:

config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

keras.__version__
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.

# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)

MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])

MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)

Epoch 1/100
  1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125

I have tried setting:

config.gpu_options.allow_growth to True.
config.gpu_options.per_process_gpu_memory_fraction to any other arbitrary value such as 0.1
commenting out: #os.environ["CUDA_VISIBLE_DEVICES"] = 0

none of them worked. I appreciate constructive answers.

Thanks in advance.

EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:

2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...

which is strange because I don't have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?