NaN loss when training regression network

python keras neural-network theano loss-function

151,120

Solution 1

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

Here are some things you could potentially try:

Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

Solution 2

The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.

In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.

Solution 3

I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.

I also find this question here. https://github.com/fchollet/keras/issues/2134. I cited the author's summary as follows：

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

I agree with the above viewpoint: the input is sensitive for your network. In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does not include -inf or inf, or some extremely large numbers in absolute value.

Solution 4

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:

print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))

you can solve this by adding a small value(0.000001) to Std like this,

def standardize(train, test):


    mean = np.mean(train, axis=0)
    std = np.std(train, axis=0)+0.000001

    X_train = (train - mean) / std
    X_test = (test - mean) /std
    return X_train, X_test

Solution 5

To sum up the different solutions mentioned here and from this github discussion, which would depend of course on your particular situation:

Add regularization to add l1 or l2 penalties to the weights. Otherwise, try a smaller l2 reg. i.e l2(0.001), or remove it if already exists.
Try a smaller Dropout rate.
Clip the gradients to prevent their explosion. For instance in Keras you could use clipnorm=1. or clipvalue=1. as parameters for your optimizer.
Check validity of inputs (no NaNs or sometimes 0s). i.e df.isnull().any()
Replace optimizer with Adam which is easier to handle. Sometimes also replacing sgd with rmsprop would help.
Use RMSProp with heavy regularization to prevent gradient explosion.
Try normalizing your data, or inspect your normalization process for any bad values introduced.
Verify that you are using the right activation function (e.g. using a softmax instead of sigmoid for multiple class classification).
Try to increase the batch size (e.g. 32 to 64 or 128) to increase the stability of your optimization.
Try decreasing your learning rate.
Check the size of your last batch which may be different from the batch size.

View more solutions

151,120

The_Anomaly

Updated on July 08, 2022

Comments

The_Anomaly almost 2 years
I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:
```
model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))

sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )
```
However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:
```
Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
 13.4925
Epoch 2/3
 88448/260000 [=========>....................] - ETA: 161s - loss: nan
```
I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.

Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.
- 1'' almost 8 years
  
  Try loss='mean_squared_error', optimizer='adam' with a single hidden layer - still nans?
- The_Anomaly almost 8 years
  
  @1'' When using the above model with Adam optimizer, I get nans. With just one layer, it does not give nans during the three epochs of training.
- pangyuteng over 4 years
  
  for future readers, here is a relevant keras thread. github.com/keras-team/keras/issues/2134 I have some success by combining all of the suggestions mentioned here. e.g. adding batchnorm, varying the learning rate, optimizer, adding clip_by_value, clip_by_global_norm, finally, combing through the code multiple times for bugs also helps, e.g. missing batch norm layer following one conv layer. :)
- Krishna vamshi almost 3 years
  
  check NAN values it solved my issue... :)
1'' over 7 years

Fair point! This is a totally legitimate strategy that's often used with recurrent neural networks, for example. However, before resorting to this it's always good to check that something simple hasn't gone wrong with the optimization.
troymyname00 about 6 years

I had the same issue as you. While checking my data, I found multiple places with inf data points. Taking those out solved the problem.
Aldo Canepa about 6 years

This resolved the problem for me, I had multiple NaNs in my embedding matrix :) Thanks.
Supamee over 5 years

How did you remove the nans from the first epoch? I'm having nans before I start training
Eran over 5 years

Regarding 1. Why not normalizing the entire output set? Also, can I use scaling instead?
1'' over 5 years

@Eran If you use the entire dataset (train + test) when deciding how to normalize, you're indirectly incorporating information about the test set into the training set, which is a form of train-test contamination. As long as you're only using the training set when deciding how to normalize, though, you can use scaling or any other kind of normalization that gives good performance.
pangyuteng over 4 years

I scale the input images (png) from 0-255 (uint8) to 0.-1.(float32), never would I have thought the input was the culprit.... adding a tf.clip_by_value prior passing the input to the net for training seem to resolved my 9 month long debug journey...
HAL9000 about 4 years

This shall be marked as the correct solution as it actually fix the specific problem rather giving advise on wider topics.
allenyllee almost 4 years

Checking the batch size and find that it is too small (16), increasing the batch size to 128 works!
NeStack almost 4 years

The same keras link suggests that gradient clipping is no longer supported. Is there an analogous solution?
CMCDragonkai over 3 years

Does this work for all optimizers? And is it always a good idea to set to 1.0?
pir over 3 years

Yep, it should work across optimizers. If your optimization problem is sufficiently simple/stable, then this is not needed and may slow training a bit w/o yielding any benefit.
Jack Kelly over 3 years

Also, note that np.isnan(np.inf) == False. To ensure none of your examples contain NaNs or Infs, you can do something like assert np.all(np.isfinite(X)). (This has caught me out several times: I believed my data was fine because I was checking for NaNs. But I'd forgotten that np.isnan doesn't notice Infs!)
grofte about 3 years

My rule of thumb with regard to batch size is that it should be as big as memory permits but at most 1% of the number of observations. 1% will give you 100 random batches which means that you still have the stochastic part of stochastic gradient descent.
philosofool about 3 years

In my experience, you can still have exploding gradients if you don't scale your data whether you have clipnorm working for you or not. Without seeing the input data in the form it takes at the Input(...) step, it's hard to know if this is the solution or not. I'm not saying this is wrong, but I can imagine someone assume that if they're getting nan losses that this will solve their problem when it may not.
momo almost 3 years

@pangyuteng could you give some detail as to what was causing the error in your case? If the input is always scaled to 0-1 by /255, I don't see how that would cause NaNs...
pangyuteng almost 3 years

@momo agreed. Looking at my comment in 2019, the 2021 me disagree with that statement. I think the gist of it is to not trust the input, and always check for nan, out of range values, prior feeding the values to your model. that keras issue#2134 is a good read if you are getting nan issues. good luck!
Hemanth Kollipara almost 3 years

True, there shouldn't be any NaN values in the data we feed to the NeuralNet.
StupidWolf almost 3 years

More of a comment
Shahar over 2 years

Exactly the issue I was facing: sometimes, we just miss the obvious. Amazing what a simple dropna() could achieve.
Admin over 2 years

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
JeeyCi almost 2 years

be carefull: with too large batch_size you can stuck in local minimum
JeeyCi almost 2 years

as far as I know, using 'adam' optimizer you do not need give lr manually as param