Representation learning of sequential data with application in bioinformatics.

Representation learning of sequential data with application in bioinformatics.
Representation learning of sequential data with
application in bioinformatics
Maria Brbić,
Divison of Electronics
Ruder Bošković Institute, Zagreb
E-mail: [email protected]
Abstract—This paper presents a brief overview of approaches
for handling sequential data and representation learning based
on the neural networks. Recurrent neural networks are described
as an approach for sequence learning. We discuss problems which
arise when training them and the possible architecture redesign,
long-short term memory networks. A variety of different approaches on representation learning have been proposed in the
literature. This paper presents a summary of deep learning ideas
and techniques, with the emphasis on the autoencoders. The final
part of the paper summarizes state-of-the-art applications of deep
learning in bioinformatics.
Index Terms—Neural networks, recurrent neural netoworks,
long short-term memory networks, deep learning, representation
learning, autoencoders
1. I NTRODUCTION
The choice of the input representation has a strong influence
on the performance of the machine learning algorithms. A
great effort usually needs to be invested to design a right set
of features that can enable efficient learning. This requires
prior human knowledge about the specific task being solved
which in some cases could be scarce and time-consuming.
Instead of manually extracting features, it would be desirable
that the learning algorithm can discover a good representation
of raw input without a human engagement and easily adjust
to different tasks. This is the idea of representation learning.
Representation learning provides an opportunity to exploit a
large amounts of unlabeled data, available in many domains,
and use it to learn a representation that captures the structure
of the input distribution.
Deep learning algorithms provide a solution by constructing
features using many layers of non-linear transformations. The
goal is to form hierarchical representations in a sense that
the features from the higher levels are composed from the
features from the lower levels. The hierarchy of features allows
the algorithm to form different levels of abstraction. This is
motivated on a way people percieve the world: describing the
abstract concept by using the less abstract ones. Although the
benefits of deep architecture had been known decades ago,
the challenge was how to train deep networks. The efficient
parallelization using GPU and advances in training techniques
revolutionised the field allowing the deep networks to outperform state-of-the-art methods on many important applications,
such as object recognition [28], speech recognition [26] and
natural language processing [11].
The paper is organized as follows. Section 2 briefly reviews
the problem of the sequence modeling and describes a possible
solution provided by recurrent neural networks. In Section 3
representation learning with the emphasis on deep learning
and autoencoders is discussed. Section 4 provides an overview
of deep learning applications in bioinformatics. Section 5
concludes the paper.
2. S EQUENCE M ODELING
Standard machine learning algorithms are based on the assumption of data points independence. However, when dealing
with sequential data where data points are strongly correlated,
this assumption leads to serious limitations. Another issue of
standard approaches is that they accept only a fixed number
of inputs whereas sequential data can have varying length
inputs and outputs. Sequence modeling tasks arise in various
domains such as time series prediction, speech processing,
handwriting recognition, natural language processing, as well
as in bioinformatics when dealing with the protein or DNA
sequences, notably protein structure and function prediction.
Thus, extending existing models to capture dependencies
across data points and to handle data of varying length is of
crucial importance.
In sequence labeling, the task is to assign an arbitrary length
input sequence (x(1) , x(2) , ..., x(T −1) , x(T ) ) to the sequence of
labels (y(1) , y(2) , ..., y(T −1) , y(T ) ). The training set is defined
as a set of pairs of sequences (x(t) , y(t) ), where the index t
denotes the current position in the sequence.
Graves [21] defines three classes of sequence labeling task:
•
•
Sequence classification. Each input sequence is assigned
to a single class. If the input sequences have fixed
length or can be easily padded, there is usually no need
for sequential algorithms and any standard classification
algorithm can be applied. However, sequential algorithms
may better adapt to distortions in the input data.
Segment classification. Target sequence consists of multiple labels, but the locations of the input segments to
which labels are assigned, are known. This is usually the
case in natural language processing and bioinformatics.
The main difference from the sequence classification is
the use of context from either side of the segments which
represents a problem for standard algorithms designed to
process one input at a time.
•
Temporal classification. This is the most general case of
sequence labeling class where nothing can be assumed
about label sequences and the algorithm has to decide
where in the input sequence labels should be assigned.
The approaches for handling such data include HMMneural network hybrids [25] and connectionist temporal
classification [23]. Temporal classification approaches are
out of scope of this review.
As it can be seen from these definitions, sequence classification is a special case of segment classification, while segment
classification is a special case of temporal classification. In
this paper we refer to the sequence labeling as the segment
classification task.
A possible approach to the sequence modeling is to use a
sliding window that includes predecessors and/or successors of
a current input and then to apply any standard classification
algorithm. Serious drawback of this approach is that we
usually do not know the optimal sliding window length and
this length can be input specific. Also, this approach suffers
from the fact that it can exploit only local dependencies
included in the current window. Long-range interactions can
not be captured using this approach.
A. Recurrent Neural Networks
Recurrent neural networks (RNNs) are extension of neural
networks designed for handling sequential data. Feedforward
networks map new inputs to outputs while forgetting the
previous states of the network. On the other hand, RNNs
allow the previous states to affect the current network state
by mapping the history of previous inputs to outputs. This
is achieved through the recurrent connections from a node to
itself. Also, unlike the traditional fully connected feedforward
network that learns separate parameters for each input feature,
RNNs share the same parameters across multiple time steps.
It is the parametar sharing that enables RNNs to generalize
across different positions in a sequence and different sequence
lengths.
Different architectures of RNNs have been proposed in the
literature, but in this chapter we refer to the representative
example of the RNNs contaning a single, self connected
hidden layer.
By unfolding the computational graph, as shown in Figure
1, the network can be seen as acyclic, deep neural network
with one layer per time step and shared weights across time
steps [38]. Thus, forward and backward pass can be applied
as in standard feedforward neural network.
The forward pass applied to the unfolded graph is the same
as in multilayer perceptron (MLP), except we need to include
connections from the hidden layer in the previous time step.
The output of the hidden layer in RNN is calculated as:
a(t) = θ(Ux(t) + Wh(t−1) + b)
(1)
The back-propagation algorithm [46] applied to RNNs is
called back-propagation through time (BPTT) and was introduced in [53]. Like back-propagation, BPTT iteratively
computes the gradient by applying the chain rule to the
unfolded graph. The gradient of a loss function is calculated
with the respect to the activation of the hidden layer through
its influence on the hidden layer at the next time step and,
as in standard back-propagation, through its influence on the
output units. Since weights need to be the same at each time
step, BPTT averages the weight updates through the whole
sequence.
One potential drawback of the standard RNNs is that they
access only past context, while completely ignoring future context. However, in many applications we want that the output
exploits information in both directions. This is more often
the case in spatial than in temporal domains. For example,
in the protein secondary structure prediction, there is no need
to differentiate between the past and the future context.
Bidirectional recurrent neural networks (BRNN), introduced
in [47], extend the standard RNN architecture providing the
solution to this limitation. Bidirectional RNNs combine an
RNN that moves forward in time with an RNN that moves
backward in time, thus using information from both the future
and the past. This is achieved by presenting each sequence
in two directions to two separate recurrent hidden layers
connected to the same output layer, as shown in Figure 2. In
the forward pass, sequence is processed in both the forward
and the backward direction to two hidden layers, and the
output is calculated after both hidden layers have processed
the entire input sequence. In the backward pass, first all output
layer errors need to be calculated and then back-propagated
to two hidden layers in opposite directions [21].
Figure 2: Unfolded bidirectional recurrent neural network.
Image from [21].
Figure 1: Unfolded recurrent neural network. Image from
[20].
The main issue with RNNs is that they can be very difficult
to train. When backpropagating errors across many time steps,
the gradient tends to become very small (usually) or very large
due to repeatedly multiplication of error terms. This leads to
the problem known in literature as vanishing gradient [32]
[7]. Even if we assume that parameters are stable, RNNs
have a difficulty with learning long-term dependencies due to
exponentially smaller weights given to long-term interactions
[20]. Although in theory RNNs are able to learn long-range
dependencies, in practice the range of context that can be
accessed is limited. As the range of context increases, the
gradient descent becomes increasingly inefficient [7].
Instead of modifying the learning algorithm, it is possible to
modify the RNN architecture so it could store information for
longer periods of time. This is the idea of the long short-term
memory networks.
connections):
g(t) = tanh(Wxg x(t) + Whg h(t−1) + bg )
i(t) = σ(Wxi x(t) + Whi h(t−1) + bi )
f(t) = σ(Wxf x(t) + Whf h(t−1) + bf )
o(t) = σ(Wxo x(t) + Who h(t−1) + bo )
(2)
c(t) = i(t) ◦ g(t) + f (t) ◦ c(t−1)
h(t) = o(t) ◦ tanh(c(t) ),
where i, f and o denote input, forget and output gate respectively, g denotes activation of the new input value, c denotes
cell state and h denotes the output of the hidden state. Symbol
◦ denotes Hadamard product.
B. Long Short-Term Memory Networks
Long short-term memory networks, introduced in [34],
represent the redesign of RNNs that overcome the vanishing
gradient problem by enforcing the constant error flow. In
LSTM network, units in the hidden layers are replaced by
the memory blocks. Each memory block contains one or more
memory cells and a pair of multiplicative gate units: input and
output gate. Memory cell is the central unit of the memory
block that stores a value over time. It is built around selfconnected linear unit that has a fixed weight of 1. This fixed
self-connection is called the constant error carrousel (CEC)
and ensures that, in the absence of the outside signal, the cell
state can remain constant through time. Multiplicative gates
enable memory cells to store and access information over
long time periods. The input gates allow or block the forwardflowing activation to enter the memory cell, while the output
gates allow or block the state of the memory cell to influence
other neurons.
The extension of this architecture was proposed in [18].
Authors discovered that the internal values of the cells could
grow without bounds and proposed extending LSTM networks
by replacing the CEC connections with the multiplicative
forget gates. By adding forget gates, network may learn when
to forget and when to rembember its previous cell state. This
extension is now standard part of the LSTM architecture.
Morevoer, in [17] the authors proposed possible extension by
adding direct connections from the internal state to the input,
output and forget gates named peephole connections.
LSTM memory block with described extensions is shown
in Figure 3. The gate activation function is usually the
logistic sigmoid (denoted with f in the figure) so that the
gate activations are between 0 and 1. The cell input and
output activation functions (denoted with g and h) are usually
hyperbolic tangent or sigmoid, although sometimes h may be
the identity function [21].
The activation of an LSTM hidden layer at time t is
calculated with the following equations (without peephole
Figure 3: Memory block with one memory cell. Peephole
connections are shown with dashed lines. All other
connections have fixed weight of 1. Small black circles
denote multiplication. Image from [21].
LSTM networks can be trained by using the back propagation through time. LSTM learns what to store in the memory
and when to access it. When the input gate is closed (takes
value around zero), new activation can not enter the cell and
change the cell state. When the output gate is closed, the
activation can not leave the cell and affect the rest of the
network. This enables the network to remember values for a
long time and thus capture the long term dependencies.
Bidirectional LSTM networks (BLSTM) [27] are able to
model long-range structure in two directions. BLSTM networks combine the LSTM architecture with the bidirectional
RNNs.
LSTM networks have solved complex artificial tasks that
have not been solved by other RNN architectures [34]. They
have also been successfuly applied in many domains requiring
long-range memory, including speech recognition [23], handwriting recognition [24] and bioinformatics [33].
In [36] the authors measured the importance of LSTM gates
and reported that the output gate is not important, the input
gate is important, while the forget gate is extremely important on all problems except language modelling. Also, they
emphasized that is very important to properly initialize the
forget gate, because otherwise the LSTM may not be capable
of solving problems that include long-range interactions.
C. Gated recurrent unit
Gated recurrent unit (GRU) were recently proposed in [9].
It is motivated by the LSTM unit, but simpler to compute.
The GRU has two gates: update and reset gate. The GRU is
defined by the following equations:
r(t) = σ(Wxr x(t) + Whr h(t−1) + br )
z(t) = σ(Wxz x(t) + Whz h(t−1) + bz )
h̃(t) = tanh(Wxh x(t) + Whh (r(t) ◦ h(t−1) ) + bh )
(3)
h(t) = z(t) ◦ h(t−1) + (1 − z(t) ) ◦ h̃(t) ,
where r denotes reset gate, z denotes update gate c.
GRU and LSTM architectures were compared in [10] and
[36]. In [10] the authors compared architectures on polyphonic
music modeling and speech modeling and reported that the
GRU outperformed the LSTM on the majority of datasets. In
[36] the architectures were compared on three different tasks
and the GRU outperformed the LSTM on all tasks with the
exception of language modelling tasks. However, once that
the bias of the LSTM forget gate was initialized to the large
positive value, the LSTM was no more lagging behind the
GRU.
3. R EPRESENTATION LEARNING
In many domains we have an access to a large amount of
data, but only a small part of the data is labeled. Representation learning is motivated by the idea of learning good
representation from the unlabeled data and then using it for the
supervised learning tasks. The good representation is the one
that captures the unknown factors of variation in the training
set distribution i.e. uncovers the latent factors that explain
the observed variations in data, as discussed in [5] and [20].
When learning representation from unlabeled data, we hope
that the features useful for the unsupervised task may also be
useful for supervised learning tasks. This is true when there
is a relationship between the input distribution P (X) and the
target conditional distribution P (Y |X). Although in general
this may not be the case, in many real world applications
some of the factors that shape the input X are predictive
of the output Y [4]. In cases where we have enough of
labeled data, representation learning can be performed in a
supervised manner. However, since the supervised signal tries
to filter out the information irrelevant for the task at hand, the
learned features are then task-specific and usually would not
be successful when applied to a different task.
Feedforward neural networks can be interpreted as performing representation learning: every hidden layer provides
representation that makes the classification in the last layer of
the network easier. Representation associated with the input is
the pattern of activation of hidden units. Although the universal
approximation theorem [12] states that the standard multilayer
feed-forward networks with a single hidden layer that contains
finite number of hidden neurons can approximate any function,
in practice shallow network architectures are representational
limited. As argued in [3], when a function can be represented
by a deep architecture, it might need a very large architecture
to be represented by the one that is not sufficiently deep.
Deep neural networks tend to learn the hierarchy of features
with the higher-level learned features composed of the lowerlevel features. The aim is to discover more abstract features
in the higher levels of the representation that are invariant
to local changes. The nice example can be demonstrated in
the application to the object recognition. Low level features
are edge detectors that can be combined to construct different
local shapes, while the combination of local shapes is used to
construct objects in the image.
Another important aspect of deep architecture is the feature
reuse. Features in the lower-levels of hierarchy are available
and can be useful to a large group of the higer-level features. In
deep architectures number of ways to reuse different features
can grow exponentialy with a depth and new configurations
of these features can be used to describe new concepts and
generalize to the unseen data. This is the idea of distributed
representations [3].
Unsupervised deep learning networks, just like their supervised counterparts, learn a representation as a byproduct of
trying to optimize the objective function. The difference is
only that in supervised case the learned representation should
indicate factors of variation important for the supervised task,
while in unsupervised case it should explain the observed
variations in the input data.
Deep neural networks require the optimization of highly
non-convex functions and searching the parameter space of
such functions represents a difficult optimization problem.
Greedy layer-wise unsupervised pretraining, introduced in
[30], has long been used to initialize deep neural networks.
It is based on the learning single-layer modules layer by layer
in an unsupervised manner. Each layer takes the output of the
previous layer and produces a new representation. These layers
are then stacked together and used to initialize deep fully
connected network. The basic single-layer learning modules
for multilayer architectures are autoencoders and Restricted
Boltzmann machines (RBMs). By using autoencoders the
network can be interpreted as a computation graph, while by
using RMBs as a probabilistic graphical model [5]. In this
review the emphasis is on the first approach.
With the invention of new initialization techniques [19] and
training approaches, such as rectifier liner unit (ReLU) [40],
dropout [48], and batch normalization [35], the unsupervised
pretraining is no longer required as a technique for training a
fully connected deep network.
One very succesful example of representation learning are
word embeddings in the Natural Language Processing (NLP)
domain [6]. In the original space, every word is represented
using one-hot encoding and therefore, every pair of words
share the same distance from each other. In the embedded
space, words are represented by the vectors that encode similarity between words so that words that are semantically similar
are close to each other. In [11] these word embeddings were
learned on a large corpus of unlabeled texts from Wikipedia.
Positive examples were original windows from Wikipedia,
while the negative examples were the same windows but with
the middle word replaced by a random word. The learned
representation is now used to solve different NLP tasks.
A. Autoencoders
Autoencoder is a neural network that is trained to reconstruct its input in the output layer through an internal
representation. Usually, we are not interested in the output
of the decoder, but we hope that the internal representation
will capture the structure of the input data
The network consists of two parts: the encoder function
fθ : X → F and the decoder function gθ : F → X , where
X denotes input space and F denotes feature space. Encoder
function fθ calculates representation vector or code h(i) of
each input vector x(i) :
h(i) = fθ (x(i) ).
(4)
Decoder function gθ produces a reconstruction x̃(i) of each
input vector x(i) given a representation vector h(i) :
x̃(i) = gθ (h(i) ).
(5)
The set of encoder and decoder parameters θ are optimized to
minimize the reconstruction error:
X
L(x(i) , gθ (fθ (x(i) )),
(6)
i
where L denotes a loss function. When f and g are linear
functions and L is the mean squared error, autoencoder learns
the same subspace as the PCA [5].
Encoder and decoder weights are usually tied. This restricts
the model’s capacity and reduces the number of parameters to
optimize. However, if we do not impose any other constraints
on the network structure or the training criterion, the network
could learn the identity function and just copy the input into
the representation with zero reconstruction error everywhere.
One way to avoid this is to constrain the representation vector
to have a lower dimension than the input vector. This forces the
autoencoder to learn a compressed representation of the input.
Instead of limiting the capacity of the model, it is possible
to change the objective function to include other properties
such as representation sparsity, smallness of the derivative and
robustness to noise [20].
Sparse autoencoders, introduced in [44], allow large number of hidden units but constrain them to be inactive most
of the time. This forces autoencoder to learn sparse features.
Representation sparsity is achieved by adding the sparsity
penalty as a regularization term so that optimization objective
becomes:
X
L(x(i) , gθ (fθ (x(i) )) + Ω(h),
(7)
i
where Ω(h) denotes sparsity penalty term.
Usually, the output of the hidden unit activations is penalized by using the L1 penalty:
X
Ω(h) = λ
|hj |,
(8)
j
or the Kullback-Liebler divergence with respect to the
Bernoulli distribution [41]:
X
(1 − ρ)
ρ
,
(9)
+ (1 − ρ) log
Ω(h) = λ
ρ log
hj
(1 − hj )
j
where λ is a hyperparameter, ρ denotes sparsity parameter
(typically small value close to zero) and hj is the average
activation of the hidden unit j across examples.
Denoising autoencoder [51] is an autoencoder that recieves
a corrupted version of an input and is trained to reconstruct
the original input from the corrupted one. In this way the
autoencoder needs to learn, not only to reconstruct, but to
denoise the input. In order to map the low-probability corrupted inputs to the clean inputs that have higher probability,
the autoencoder is encouraged to learn the structure of the
data. The denosing autoencoder works by first corrupting the
original input with some corruption process and then the
corrupted input is mapped to an internal representation from
which the autoencoder needs to produce the original input.
As with the basic autoencoder, the optimization objective is
to minimize the reconstruction error:
X
L(x(i) , gθ (fθ (x̃(i) )),
(10)
i
where x̃ denotes the corrupted version of the original input x.
In [51] the corruption process was masking noise: parts of
the input were erased, while other parts were left unchanged.
Specifically, for each input x, a random chosen components
were forced to zero and the autoencoder was trained to fill the
missing input data. Other corruption noises can also be used,
such as additive Gaussian noise and salt-and-pepper noise,
considered in [52].
Contractive autoencoders [45] encourage the representation robustness to the small input changes by penalizing the
sensitivity of the features to the small input perturbations.
The penalty term corresponds to the Frobenius norm of the
Jacobian matrix of the encoder activations with respect to the
input. Low valued first derivatives lead to the robustness of
the representation to small changes of the input. Minimizing
only the Jacobian term would lead to learning a constant
representation, but in the combination with the minimization of
the reconstruction error, contractive autoencoders are enforced
to learn a representation from which the input can be well
reconstructed.
The objective function of the contractive autoencoder is
given by:
X
(L(x(i) , gθ (fθ (x(i) )) + λ||Jfθ (x(i) )||2F ),
(11)
i
where λ is a hyperparameter and Jf (x) is the Jacobian matrix
of the function f with respect to x and ||·||F denotes Frobenius
norm.
There is a relationship between the contractive and the denoising autoencoders. In particular, the denoising autoencoder
with very small Gaussian corruption and squared error loss
function can be seen as a particular kind of the contractive
autoencoder, as shown in [1].
In the case of training a deep autoencoder, the usual strategy
is layer-by-layer training and then these shallow autoencoders
are stacked together [20]. This is successfuly done in [31] to
reduce the dimensionality of the data.
B. Representation learning for sequential data
RNNs can be seen as deep networks when unfolded in time.
However, the depth of RNNs is related to addressing the need
for the memory and not for the construction of hierarchical
features as in deep feedforward neural networks. At each time
step of RNN, current input has only one hidden layer to pass
before it can be processed to the output. From this point of
view, RNN can be interpreted as having a shallow architecture.
However, just like feedforward neural networks, RNNs can
also benefit from depth and they have shown to outperform
the shallow RNNs on different tasks.
In [29] deep RNN was constructed by stacking shallow
RNNs. Each layer of the network is an RNN and therefore
recieves as input hidden state of the layer from the previous
time step, but it also recieves the hidden state of the previous
layer from the same time step. This architecture naturally
combines recurrent and deep neural network architecture. The
authors trained 5-layered network with stochastic gradient
descent and reached state-of-the-art performance on the task
of predicting the next character on the Wikipedia text-corpus.
In [26] stacked bidirectional LSTM networks were employed and achieved the best recorded score on the phoneme
recognition benchmark dataset.
Besides stacking RNNs, in [42] other deep RNN architectures were proposed. RNNs can be made deep by adding layers
between the hidden layer from the current time step and the
hidden layer from the previous time step, by adding layers
between the input layer and the hidden layer or by adding
layers between the hidden and the output layer. Proposed
architectures are visualized on Figure 4. It is possible to
introduce shortcut connections from the input to all hidden
layers and/or from all hidden layers to the output to make the
training easier and to mitigate the vanishing gradient problem.
The authors trained 2-3 layered networks for two different
tasks: polyphonic music prediction and language modeling. In
both cases the deep architecture outperformed the shallow one.
The experiments suggested that each of the proposed RNNs
architectures has different characteristics and is suitable for
some datasets, so there is no clear winner between them. Also,
the authors reported that the deep RNNs were difficult to train
and that it may become even more problematic with the depth
increase. However, the experiments suggest that deep RNNs
can benefit from strategies used for training deep feedforward
networks such as ReLU activations and dropout.
Figure 4: Different architectures of deep RNN proposed in
[42]. (a) Standard RNN. (b) Deep Transition (DT) RNN.
(b*) DT-RNN with shortcut connections (c) Deep Transition,
Deep Output (DOT) RNN. (d) Stacked RNN. Image from
[42].
By training an RNN to predict the next element of the
sequence, the network can learn a probability distribution
over a sequence. In [22] RNN was used to generate new
sequences containing long-range structure. The authors trained
deep RNN obtained by stacking LSTM layers for the task of
text prediction and online handwriting.
RNN Encoder–Decoder that consists of two RNNs was
proposed in [9] and [49]. The encoder maps a variable-length
input sequence to a fixed-length representational vector, while
the decoder maps the representation sequence to the target
sequence. The encoder and decoder were jointly trained to
maximize the conditional probability of a target sequence
given a source sequence. This approach was applied to the
machine translation task: translating from English to French
language. Interesting thing to mention is that in [49] the
authors proposed to reverse the order of the input sequence,
but not the target sequence. They do not have a complete
explanation why this works better, but they believe that it
is caused by the fact that the first few words in the source
language become very close to the first few words in the target
language, and therefore it is easier to find relationship between
the source sentence and the target sentence.
Although the task of generating new sequences and machine
translation are different from each other, the common thing is
that both these approaches produce a fixed length summary of
an input sequence in the last hidden layer. Moreover, in [9]
the authors proved that the proposed model learns meaningful
representations.
4. A PPLICATION IN BIOINFORMATICS
With the development of high-throughput DNA and protein
sequencing technologies, the number of sequenced genomes
has rapidly increased. This explosion of growth of sequenced
data provides the opportunity for improving the performance
on bioinformatics tasks, such as modeling protein structure and
function annotation. As an illustration, less than 1/1000 of the
sequenced proteins has assigned structure [39] and predicting
the protein structure from its sequence remains one of the
major challenges in the field. Instead of hand-crafting features,
this wealth of the sequenced data could be used to construct
meaningful representations that could than be used to describe
complex biological processes. The favorable representation is
the one that takes into account the local, as well as the longrange interactions.
Despite of the great success of deep learning algorithms in
other domains, they have been applied in a modest number
of works in bioinformatics. In [13] and [16] deep architecture
was applied for the task of protein contact map prediction.
The proposed approach in [13] consisted of three steps.
First, BRNN was used for the prediction of contacts between
secondary structure elements, so called coarse contacts. In
the second step, energy based model was used to predict
energy between residues by aligning the sequences. Finally,
the deep architecture was employed to predict residue-residue
contacts using as features the outputs of the previous steps,
as well as the residue-residue features such as secondary
structure and evolutionary information. These were the spatial
features of the model. In order to obtain the temporal features,
feedforward networks were stacked so that each layer in the
stack predicted a contact map, which was then used as the
input in the subsequent layers. The goal of using temporal
input features was the refinement of predictions in each new
layer. The authors reported to achieve the accuracy close to
30% for ab initio long-range contacts predictions, beating the
previous best predictors. In [16] the authors used the boosted
ensembles of deep network classifiers to predict contact maps.
Features included sequence specific values for the residues
in two windows centered around the residue–residue contact
pair, global features and values characterizing the sequence
between the contact pair. The network was initialized using unsupervised pretraining. Deep convolutional generative
stochastic network was introduced in [54]. The network was
trained for the 8-state protein secondary structure prediction
problem, outperforming the previous methods. In [8] deep
autoencoder outperformed previous methods for the prediction
of gene ontology annotation that can be viewed as a matrix
completion problem. Furthermore, deep learning model that
can predict splicing patterns in tissues inferred from mouse
RNA-Seq data was proposed in [37]. Inputs into the model
consisted of tissue types and genomic features describing an
exon, neighboring introns and adjacent exons. The network
had three hidden layers, but the first hidden layer was used
only for genomic features and trained as an autoencoder.
The tasks were the prediction of the discretized percentage
of transcripts with an exon spliced in (PSI) and the prediction of discretized PSI difference between two tissues.
The authors reported that the use of dropout improved the
model performance. Compared to the previous best method,
the developed model was comparable or significantly better
for all tissue types. Recently, in [55] the authors developed
DeepSea, deep learning framework for the prediction of the
epigenetic state of a sequence. Each training example was
1000-bp nucleotide sequence encoded as 1000 × 4 binary
matrix where columns corresponded to four nucleotides, while
targets were binary encoded chromatin features. The authors
trained deep convolutional neural network and showed that it
predicts chromatin features with the high accuracy.
On the other hand, shallow and recurrent neural networks
have a long history of successful applications, especially for
the tasks of protein structure prediction. Standard feedforward
neural networks typically use a sliding window approach
centered around target residue, as in [15] for the secondary
structure prediction and in [43] for the contact map prediction.
The other often used approach is to extract hand-designed
features from the sequence and then apply learning algorithm
on the fixed-dimensional feature vectors, as done in [14] for
the protein fold recognition. The bidirectional recurrent neural
networks have also been used to solve different problems
involving biological sequences, such as in [2] for the secondary
structure prediction and in [50] for the contact map prediction.
5. C ONCLUSION
Deep neural networks have shown a great success in a
variety of domains such as object recognition, speech recognition and natural language processing, but the application for
solving the problems in bioinformatics lags far behind. Prediction of complex biological processes by using hand-designed
features of sequential data has led to weak performance on
a number of tasks. The abundant amounts of sequenced data
which are nowadays available could be exploited by using
deep learning algorithms to construct meaningful features.
This provides the opportunity to interpret complex biological
processes that are currently beyond our understanding.
R EFERENCES
[1] G. Alain and Y. Bengio. What regularized auto-encoders learn from the
data generating distribution. In ICLR’2013, 2013.
[2] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and Gi. Soda. Bidirectional
Dynamics for Protein Secondary Structure Prediction. Lecture Notes in
Computer Science, 1828:80–104, 2001.
[3] Y. Bengio. Learning Deep Architectures for AI. Foundations and
Trends® in Machine Learning, 2(1):1–127, 2009.
[4] Y. Bengio. Deep Learning of Representations for Unsupervised and
Transfer Learning. JMLR: Workshop and Conference Proceedings 7,
7:1–20, 2011.
[5] Y. Bengio, A. Courville, and P. Vincent. Representation Learning:
A Review and New Perspectives. IEEE Trans. Pattern Analysis and
Machine Intelligence (PAMI), 35(8):1798–1828, 2013.
[6] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A Neural Probabilistic Language Model. The Journal of Machine Learning Research,
3:1137–1155, 2003.
[7] Y. Bengio, P. Simard, and P. Frasconi. Learning Long Term Dependencies with Gradient Descent is Difficult, 1994.
[8] D. Chicco, P. Sadowski, and P. Baldi. Deep autoencoder neural networks
for gene ontology annotation predictions. Proceedings of the 5th
ACM Conference on Bioinformatics, Computational Biology, and Health
Informatics - BCB ’14, pages 533–540, 2014.
[9] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio. Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine Translation. Proceedings
of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734, 2014.
[10] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical Evaluation of
Gated Recurrent Neural Networks on Sequence Modeling. arXiv, pages
1–9, 2014.
[11] R. Collobert and J. Weston. A unified architecture for natural language
processing. Proceedings of the 25th international conference on Machine
learning, 20:160–167, 2008.
[12] G. Cybenko. Degree of approximation by superpositions of a sigmoidal
function. Mathematics of Control, Signals, and Systems, 2(4):303–314,
1989.
[13] P. Di Lena, K. Nagata, and P. Baldi. Deep architectures for protein
contact map prediction. Bioinformatics, 28(19):2449–2457, 2012.
[14] C. H.Q. Ding and I. Dubchak. Multi-class protein fold recognition
using support vector machines and neural networks. Bioinformatics,
17(4):349–358, 2001.
[15] O. Dor and Y. Zhou. Achieving 80% Ten-fold Cross-validated Accuracy
for Secondary Structure Prediction by Large-scale Training. Proteins,
66:838–845, 2007.
[16] J. Eickholt and J. Cheng. Predicting protein residue-residue contacts
using deep networks and boosting. Bioinformatics, 28(23):3066–3072,
2012.
[17] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count.
Proceedings of the IEEE-INNS-ENNS International Joint Conference
on Neural Networks. IJCNN 2000. Neural Computing: New Challenges
and Perspectives for the New Millennium, 1:189–194 vol.3, 2000.
[18] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget:
continual prediction with LSTM. Neural computation, 12:2451–2471,
2000.
[19] X. Glorot and Y. Bengio. Understanding the difficulty of training deep
feedforward neural networks. Proceedings of the 13th International
Conference on Artificial Intelligence and Statistics (AISTATS), 9:249–
256, 2010.
[20] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in
preparation for MIT Press, 2016.
[21] A. Graves. Supervised Sequence Labelling with Recurrent Neural
Networks. Image Rochester NY, page 124, 2008.
[22] A. Graves. Generating sequences with recurrent neural networks. arXiv
preprint arXiv:1308.0850, pages 1–43, 2013.
[23] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist
Temporal Classification : Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. Proceedings of the 23rd international
conference on Machine Learning, pages 369–376, 2006.
[24] A. Graves, S. Fernández, and M. Liwicki. Unconstrained online
handwriting recognition with recurrent neural networks. Advances in
Neural Information Processing Systems, 20:1–8, 2008.
[25] A. Graves, S. Fernández, and J. Schmidhuber. Bidirectional LSTM
Networks for Improved Phoneme Classification and Recognition. The
15th international conference on Artificial neural networks: formal
models and their applications - Volume Part II, pages 799–804, 2005.
[26] A. Graves, A. Mohamed, and G. E. Hinton. Speech Recognition With
Deep Recurrent Neural Networks. ICASSP, (3):6645–6649, 2013.
[27] A. Graves and J. Schmidhuber. Framewise phoneme classification with
bidirectional LSTM networks. Proceedings of the International Joint
Conference on Neural Networks, 4:2047–2052, 2005.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
Residual Learning for Image Recognition. Arxiv.Org, 7:171–180, 2015.
[29] M. Hermans and B. Schrauwen. Training and Analyzing Deep Recurrent
Neural Networks. In NIPS 2013, pages 190–198, 2013.
[30] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for
deep belief nets. Neural computation, 18:1527–54, 2006.
[31] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of
Data with Neural Networks. Science (New York, N.Y.), 313(July):504–
507, 2006.
[32] S Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen.
Master’s thesis, Institut fur Informatik, Technische Universitat, Munchen,
1991.
[33] S. Hochreiter, M. Heusel, and K. Obermayer. Fast model-based protein
homology detection without alignment. Bioinformatics, 23(14):1728–
1736, 2007.
[34] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
computation, 9(8):1735–80, 1997.
[35] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift. arXiv, 2015.
[36] R. Jozefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration
of Recurrent Network Architectures. Proceedings of the 32nd international conference on Machine Learning - ICML ’15, 2015.
[37] M. K. Leung, H.Y. Xiong, L. J. Lee, and B. J. Frey. Deep learning of
the tissue-regulated splicing code. Bioinformatics, 30:121–129, 2014.
[38] Z. Lipton, J. Berkowitz, and C. Elkan. A Critical Review of Recurrent
Neural Networks for Sequence Learning. arXiv preprint, pages 1–35,
2015.
[39] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A. Tramontano.
Critical assessment of methods of protein structure prediction (CASP)–
round x. Proteins, 82 Suppl 2(0 2):1–6, 2014.
[40] V. Nair and G. E. Hinton. Rectified Linear Units Improve Restricted
Boltzmann Machines. Proceedings of the 27th International Conference
on Machine Learning, (3):807–814, 2010.
[41] A. Ng. Sparse autoencoder. CS294A Lecture notes, pages 1–19, 2011.
[42] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to Construct
Deep Recurrent Neural Networks. In ICLR’2014, 2014.
[43] M. Punta and B. Rost. PROFcon: Novel prediction of long-range
contacts. Bioinformatics, 21(13):2960–2968, 2005.
[44] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient Learning
of Sparse Representations with an Energy-Based Model. Advances in
Neural Information Processing Systems, pages 1137–1144, 2007.
[45] S. Rifai and X. Muller. Contractive Auto-Encoders : Explicit Invariance
During Feature Extraction. Proceedings of the 28th International
Conference on Machine Learning, 85(1):833–840, 2011.
[46] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal
Representations by Error Propagation, 1986.
[47] M. Schuster and K. K Paliwal. Bidirectional recurrent neural networks.
IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
[48] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks
from Overfitting. Journal of Machine Learning Research (JMLR),
15:1929–1958, 2014.
[49] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with
neural networks. Advances in Neural Information Processing Systems
(NIPS), pages 3104–3112, 2014.
[50] A. N. Tegge, Z. Wang, J. Eickholt, and J. Cheng. NNcon: Improved
protein contact map prediction using 2D-recursive neural networks.
Nucleic Acids Research, 37(May):515–518, 2009.
[51] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and
composing robust features with denoising autoencoders. Proceedings of
the 25th international conference on Machine learning, pages 1096–
1103, 2008.
[52] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol.
Stacked Denoising Autoencoders: Learning Useful Representations in a
Deep Network with a Local Denoising Criterion. Journal of Machine
Learning Research, 11:3371–3408, 2010.
[53] P. J. Werbos. Backpropagation Through Time: What It Does and How
to Do It. Proceedings of the IEEE, 78(October):1550–1560, 1990.
[54] J. Zhou and O. G. Troyanskaya. Deep Supervised and Convolutional
Generative Stochastic Network for Protein Secondary Structure Prediction. Proceedings of the 31st International Conference on Machine
Learning, 32:745–753, 2014.
[55] J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding
variants with deep learning-based sequence model. Nature methods,
12(August):931–4, 2015.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement