What is the intuition of using tanh in LSTM?

48,064

Solution 1

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates.

On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh is a good function with the above property.

A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use ReLU in place of the tanh function since they are very good alternatives of each other.

But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages.

Sigmoid

Mathematical expression: sigmoid(z) = 1 / (1 + exp(-z))

First-order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages:

(1) The sigmoid function has all the fundamental properties of a good activation function.

Tanh

Mathematical expression: tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

First-order derivative: tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice
(2) Gradient computation is less expensive

Hard Tanh

Mathematical expression: hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

First-order derivative: hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh
(2) Saturate for magnitudes of z greater than 1

ReLU

Mathematical expression: relu(z) = max(z, 0)

First-order derivative: relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z
(2) Found much success in computer vision applications

Leaky ReLU

Mathematical expression: leaky(z) = max(z, k dot z) where 0 < k < 1

First-order derivative: relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't

This paper explains some fun activation function. You may consider to read it.

Solution 2

LSTMs manage an internal state vector whose values should be able to increase or decrease when we add the output of some function. Sigmoid output is always non-negative; values in the state would only increase. The output from tanh can be positive or negative, allowing for increases and decreases in the state.

That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary. Check out the diagrams and explanations in Chris Olah's Understanding LSTM Networks for more.

The related question, "Why are sigmoids used in LSTMs where they are?" is also answered based on the possible outputs of the function: "gating" is achieved by multiplying by a number between zero and one, and that's what sigmoids output.

There aren't really meaningful differences between the derivatives of sigmoid and tanh; tanh is just a rescaled and shifted sigmoid: see Richard Socher's Neural Tips and Tricks. If second derivatives are relevant, I'd like to know how.

Share:
48,064
DNK
Author by

DNK

Updated on July 09, 2022

Comments

  • DNK
    DNK almost 2 years

    In an LSTM network (Understanding LSTMs), why does the input gate and output gate use tanh?

    What is the intuition behind this?

    It is just a nonlinear transformation? If it is, can I change both to another activation function (e.g., ReLU)?

  • DNK
    DNK over 7 years
    so, say i want to change the activation to RelU, i must change both tanh in input gate activation and in output multiplication, is that correct @Wasi Ahmad?
  • Wasi Ahmad
    Wasi Ahmad over 7 years
    @DNK yes, i would say to maintain a sort of consistency this is necessary.
  • DNK
    DNK over 7 years
    so, in other word, -apart from sigmoid used for gates- i could choose the activation function like in standard neural network layer @WasiAhmad
  • Wasi Ahmad
    Wasi Ahmad over 7 years
    @DNK yes. but it depends on the application you are targeting and the type of your data. If your data better suits a specific activation unit, use it. but try other units to see how they perform in your application. frankly speaking, all of the activation units are more or less effective in neural networks!
  • MGwynne
    MGwynne about 7 years
    The paper link above doesn't seem to work any more, but I believe it is referring to: pdfs.semanticscholar.org/a26f/… / citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.6996 .
  • s0urcer
    s0urcer almost 7 years
    There is a small mistake in sigmoid first derivative. It actually equals: sigmoid'(z) = exp(-z) / (1 + exp(-z))^2
  • Aaron Schumacher
    Aaron Schumacher over 6 years
    Where the above answer is correct, it is not relevant to the question. Derivatives do not distinguish tanh from sigmoid, as tanh is just a rescaled and shifted sigmoid. (See: cs224d.stanford.edu/lectures/CS224d-Lecture6.pdf)
  • MiloMinderbinder
    MiloMinderbinder over 6 years
    "On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero." - why??
  • zeal
    zeal almost 6 years
    @MiloMinderbinder Please do tell if you have the answer to the question you asked about "On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero"?
  • End-2-End
    End-2-End over 5 years
    @Aaron, thanks for the answer. In Chris Olah's blog, for the last step of computation inside LSTM cell, we see ht = Ot * tanh(Ct). Ot, coming from a sigmoid gate lies in [0,1]. tanh(Ct) lies in [-1,1]. So does this mean that the output of an LSTM cell is always between -1 and +1?
  • Aaron Schumacher
    Aaron Schumacher over 5 years
    @End-2-End that sounds right to me, with the possibly unnecessary clarification that in general the whole thing is operating on vectors, so the bounds are on each component.
  • Aaron Schumacher
    Aaron Schumacher over 5 years
    @RohitTidke in the sense of en.wikipedia.org/wiki/Second_partial_derivative_test, yes
  • realmq
    realmq almost 5 years
    Agree with @MiloMinderbinder. Why?? that is the key to answer this question. The rest is more about the correction of sigmoid rather than tanh in gates.
  • End-2-End
    End-2-End almost 5 years
    @AaronSchumacher, you mentioned that values in the state should be able to both increase and decrease and since sigmoid always has non-negative output, tanh is preferred activation function for output. Then wouldn't it be the same with ReLU since those are also always non-negative? Does it mean LSTMs wouldn't work as expected if we replace tanh with ReLU?
  • Aaron Schumacher
    Aaron Schumacher almost 5 years
    @End-2-End that sounds right to me. ReLU is non-negative.
  • basav
    basav over 4 years
    the domain and range of tanh covers both increase and decrease property for internal state, so intuitively, that makes sense.
  • Quastiat
    Quastiat over 4 years
    @Aaron, thanks for the answer, i just got one question. You say ' values in the state would only increase. ' But by multiplying the state with the forget gate which is in [0,1], the state could still be decreased? I still get the idea, but thats not 100% clear for me
  • Aaron Schumacher
    Aaron Schumacher over 4 years
    @Quastiat We can think about the additive and multiplicative parts separately: "values should be able to increase or decrease when we add the output of some function. Sigmoid output is always non-negative; values in the state would only increase." That's true of the additive part. But yes, multiplying by a number between zero and one does decrease the absolute value. (It still can't change the sign.)
  • jonathanking
    jonathanking over 4 years
    While this explains the state update rule, this fails to address the fact that the output gate of the LSTM incorporates a tanh layer, h_t = o_t * tanh(C_t).. The reason for this is that it can renormalize the hidden state to lie between [-1,1] after the state update addtition operation.
  • Bruce Yo
    Bruce Yo over 3 years
    Just give my intuition, because, in backpropagation, it will calculate not only the "second derivative", but also "third, fourth derivatives, etc." if there are more layers of an activation function (e.g., tanh or sigmoid). By doing so, tanh will still reserve the whole concept of the information (i.e., both the positive and negative values) while sigmoid just keep the positive part. @realmq, please see if that makes sense. Thanks.