FYP17005

Related Concepts

One-hot Encoding

Generally, machine learning models take input variables in vector forms. In order to feed words which have no intuitive vector representations into artificial neural networks, each word has to be encoded into a unique vector.

The most commonly adopted method is called "one-hot encoding", which represents each word $w$ in a sorted vocabulary $V$ as an $\mathbb{R}^{\|V\|}$ vector with a single 1 at the index of $w$ in $V$ and 0s at all the other positions. $\|V\|$ means the size of the vocabulary. For example:

$$ w_{abandon} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, w_{abase} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \dotsc, w_{zesty} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix} $$

Related Concepts

Word Embedding

Word Encoding with Gaussian Filtering

Apart from word embedding, some other encoding methods are also helpful in some certain NLP tasks. For ordinal classification problems, the difference between each pair of labels should be distinguished from another pair. For instance, a classification over five integers from 1 to 5 may wish to learn the difference between two pairs of vectors, $(w1,w3)$ and $(w1,w5)$. For one-hot encoding, the following vectors are supposed to represent the integers:

$$ w_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \dotsc, w_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \dotsc, w_5 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} $$

thus we have $|w_1 - w_3| = |w_3 - w_5|$

The above equation reveals the lost of the ordinal information. To improve the word encoding, Gaussian filter can be applied to smoothes the one-hot vectors. Gaussian filter will make the elements of a vector present a Gaussian distribution based on the original vector. For example, the Gaussian filtered vectors of the above vectors would be:

$$ w'_1 = \begin{bmatrix} 0.89 \\ 0.10 \\ 0.01 \\ 0 \\ 0 \end{bmatrix}, \dotsc, w'_3 = \begin{bmatrix} 0.01 \\ 0.10 \\ 0.78 \\ 0.10 \\ 0.01 \end{bmatrix}, \dotsc, w'_5 = \begin{bmatrix} 0 \\ 0 \\ 0.89 \\ 0.10 \\ 0.01 \end{bmatrix} $$

thus, $|w'_1 - w'_3| \not = |w'_3 - w'_5|$

Related Concepts

Long Short-Term Memory

Bidirectional RNN

Standard RNN and LSTM networks take previous information in a sequence into consideration. However, future contexts are ignored. To acquire more information for the network, Bidirectional Recurrent Neural Networks (BRNN) were introduced to increase the amount of input information available to the network. In BRNNs, a regular RNN unit is split into two directions, one with the original input sequence (forward pass), and another with the reversed sequence (backward pass). The following figure gives a comparison between the structures of standard RNN and BRNN. With two directions, input information from the past and future of the current time frame can be used, unlike unidirectional RNN which requires the delays for including future information.

Structure overview of RNN and BRNN: (a) RNN, (b) BRNN.

The training processes applied to BRNN are very similar to that of RNN but with slight differences. The structure of BRNN requires additional processes for backpropagation because it cannot update input and output layers at once. Generally, for forward pass, forward states and backward states are passed first, then output neurons are passed. For backward pass, output neurons are passed first, then forward states and backward states are passed next. Then, the weights are updated after these two procedures.

Related Concepts

Convolutional Neural Networks

Convolutional Neural Network (CNN) is a widely adopted artificial neural network architecture in deep learning. CNNs use different sizes of "kernels", in other words, sliding windows, to scan inputs and extract high-level patterns. Although the most successful application of CNN is image processing, it also shows potential in natural language processing. Using one-dimensional kernels, CNNs can also extract contextual information like LSTM networks. The following figure shows a sample of design of the architecture of CNN for sentiment analysis. This CNN architecture has achieved noticeable performance across various classification datasets, mostly comprised of sentiment analysis tasks, and new state-of-the-art on a few.

E.g. CNN model architecture with two channels for an example sentence.

Instead of image pixels, the input to CNN model for NLP are sentences represented as a matrix, each row of which corresponds to one embedded word. As what is showed in the above figure, the input is an $n \times k$ matrix where n is the dimension of word vectors and $k$ is the length of the longest input in the training data. There are multiple filters and feature maps to capture the features of the sentence, which are expected to be the keywords reflecting the sentiment expressed by the sentence. Finally, there is a fully connected layer with dropout and softmax output to predict the final results.

Related Concepts

Word Segmentation

The difference between English and Chinese introduces additional challenges for Chinese language processing. English is a word-based language because in a English sentence, the words are naturally separated by whitespaces. However, the Chinese language is character-based. In most cases, a single Chinese character cannot form by itself a complete semantic group equivalent to an English word. Word segmentation is a process which takes a sequence of Chinese characters and returns the same sequence with semantic words grouped and separated. For a simple example, given a Chinese sentence, “这是一部好电影 (This is a good movie)”, a correct sequence after word segmentation looks like:

这是一部好电影

In most of Chinese NLP tasks, word segmentation is a necessary procedure before encoding the words into vectors.