Long Short-Term Memory (LSTM) is a special Recurrent Neural Network (RNN) architecture inspired by biological memory. RNN has been successfully adopted in existing studies of various NLP tasks because of its flexibility to handle dynamic input sequences and ability to comprehend relations between inputs within a sequence.
However, the standard RNN unit contains a single \(tanh\) layer and suffers from the problem of vanishing gradient, thus cannot capture long-term dependencies and cannot be trained efficiently.
To solve this problem, LSTM introduces a mechanism of "long-term memory" in addition to the "short-term memory" effect of the standard RNN. The following figure shows the inside look of a LSTM unit.
The inputs of an LSTM unit at time point \(t\) include the current input \(x_t\), the previous output \(h_{t-1}\), and the previous memory \(C_{t-1}\). First of all, the unit decides what to "forget" from the previous memory via a "forget gate" defined as
$$
f_t = \sigma(W_f\cdot[h_{t-1}, x_t] + b_f)
$$
This gate takes the previous output and the current data as input, and produces values between 0 and 1 via a sigmoid function. Next, the LSTM unit decides what "new knowledge" to memorize with an "input gate":
$$
i_t = \sigma(W_i\cdot[h_{t-1}, x_t] + b_i)
$$
And the new information to be memorized is generated by a tanh layer:
$$
\tilde{C}_t = \tanh(W_C\cdot[h_{t-1}, x_t] + b_C)
$$
Having the above definitions, the LSTM unit updates the long-term memory:
$$
C_t = f_t\ast C_{t-1} + i_t\ast\tilde{C}_t
$$
Finally, the output of the unit is generated by an "output gate":
$$
o_t = \sigma(W_o\cdot[h_{t-1}, x_t] + b_o)
$$
$$
h_t = o_t\ast\tanh(C_t)
$$
LSTM is very useful in NLP tasks because high-level semantics are often reflected in sequences of consecutive words in human languages.