Representation learning of sequential data with application in bioinformatics Maria Brbić, Divison of Electronics Ruder Bošković Institute, Zagreb E-mail: [email protected] Abstract—This paper presents a brief overview of approaches for handling sequential data and representation learning based on the neural networks. Recurrent neural networks are described as an approach for sequence learning. We discuss problems which arise when training them and the possible architecture redesign, long-short term memory networks. A variety of different approaches on representation learning have been proposed in the literature. This paper presents a summary of deep learning ideas and techniques, with the emphasis on the autoencoders. The final part of the paper summarizes state-of-the-art applications of deep learning in bioinformatics. Index Terms—Neural networks, recurrent neural netoworks, long short-term memory networks, deep learning, representation learning, autoencoders 1. I NTRODUCTION The choice of the input representation has a strong influence on the performance of the machine learning algorithms. A great effort usually needs to be invested to design a right set of features that can enable efficient learning. This requires prior human knowledge about the specific task being solved which in some cases could be scarce and time-consuming. Instead of manually extracting features, it would be desirable that the learning algorithm can discover a good representation of raw input without a human engagement and easily adjust to different tasks. This is the idea of representation learning. Representation learning provides an opportunity to exploit a large amounts of unlabeled data, available in many domains, and use it to learn a representation that captures the structure of the input distribution. Deep learning algorithms provide a solution by constructing features using many layers of non-linear transformations. The goal is to form hierarchical representations in a sense that the features from the higher levels are composed from the features from the lower levels. The hierarchy of features allows the algorithm to form different levels of abstraction. This is motivated on a way people percieve the world: describing the abstract concept by using the less abstract ones. Although the benefits of deep architecture had been known decades ago, the challenge was how to train deep networks. The efficient parallelization using GPU and advances in training techniques revolutionised the field allowing the deep networks to outperform state-of-the-art methods on many important applications, such as object recognition [28], speech recognition [26] and natural language processing [11]. The paper is organized as follows. Section 2 briefly reviews the problem of the sequence modeling and describes a possible solution provided by recurrent neural networks. In Section 3 representation learning with the emphasis on deep learning and autoencoders is discussed. Section 4 provides an overview of deep learning applications in bioinformatics. Section 5 concludes the paper. 2. S EQUENCE M ODELING Standard machine learning algorithms are based on the assumption of data points independence. However, when dealing with sequential data where data points are strongly correlated, this assumption leads to serious limitations. Another issue of standard approaches is that they accept only a fixed number of inputs whereas sequential data can have varying length inputs and outputs. Sequence modeling tasks arise in various domains such as time series prediction, speech processing, handwriting recognition, natural language processing, as well as in bioinformatics when dealing with the protein or DNA sequences, notably protein structure and function prediction. Thus, extending existing models to capture dependencies across data points and to handle data of varying length is of crucial importance. In sequence labeling, the task is to assign an arbitrary length input sequence (x(1) , x(2) , ..., x(T −1) , x(T ) ) to the sequence of labels (y(1) , y(2) , ..., y(T −1) , y(T ) ). The training set is defined as a set of pairs of sequences (x(t) , y(t) ), where the index t denotes the current position in the sequence. Graves [21] defines three classes of sequence labeling task: • • Sequence classification. Each input sequence is assigned to a single class. If the input sequences have fixed length or can be easily padded, there is usually no need for sequential algorithms and any standard classification algorithm can be applied. However, sequential algorithms may better adapt to distortions in the input data. Segment classification. Target sequence consists of multiple labels, but the locations of the input segments to which labels are assigned, are known. This is usually the case in natural language processing and bioinformatics. The main difference from the sequence classification is the use of context from either side of the segments which represents a problem for standard algorithms designed to process one input at a time. • Temporal classification. This is the most general case of sequence labeling class where nothing can be assumed about label sequences and the algorithm has to decide where in the input sequence labels should be assigned. The approaches for handling such data include HMMneural network hybrids [25] and connectionist temporal classification [23]. Temporal classification approaches are out of scope of this review. As it can be seen from these definitions, sequence classification is a special case of segment classification, while segment classification is a special case of temporal classification. In this paper we refer to the sequence labeling as the segment classification task. A possible approach to the sequence modeling is to use a sliding window that includes predecessors and/or successors of a current input and then to apply any standard classification algorithm. Serious drawback of this approach is that we usually do not know the optimal sliding window length and this length can be input specific. Also, this approach suffers from the fact that it can exploit only local dependencies included in the current window. Long-range interactions can not be captured using this approach. A. Recurrent Neural Networks Recurrent neural networks (RNNs) are extension of neural networks designed for handling sequential data. Feedforward networks map new inputs to outputs while forgetting the previous states of the network. On the other hand, RNNs allow the previous states to affect the current network state by mapping the history of previous inputs to outputs. This is achieved through the recurrent connections from a node to itself. Also, unlike the traditional fully connected feedforward network that learns separate parameters for each input feature, RNNs share the same parameters across multiple time steps. It is the parametar sharing that enables RNNs to generalize across different positions in a sequence and different sequence lengths. Different architectures of RNNs have been proposed in the literature, but in this chapter we refer to the representative example of the RNNs contaning a single, self connected hidden layer. By unfolding the computational graph, as shown in Figure 1, the network can be seen as acyclic, deep neural network with one layer per time step and shared weights across time steps [38]. Thus, forward and backward pass can be applied as in standard feedforward neural network. The forward pass applied to the unfolded graph is the same as in multilayer perceptron (MLP), except we need to include connections from the hidden layer in the previous time step. The output of the hidden layer in RNN is calculated as: a(t) = θ(Ux(t) + Wh(t−1) + b) (1) The back-propagation algorithm [46] applied to RNNs is called back-propagation through time (BPTT) and was introduced in [53]. Like back-propagation, BPTT iteratively computes the gradient by applying the chain rule to the unfolded graph. The gradient of a loss function is calculated with the respect to the activation of the hidden layer through its influence on the hidden layer at the next time step and, as in standard back-propagation, through its influence on the output units. Since weights need to be the same at each time step, BPTT averages the weight updates through the whole sequence. One potential drawback of the standard RNNs is that they access only past context, while completely ignoring future context. However, in many applications we want that the output exploits information in both directions. This is more often the case in spatial than in temporal domains. For example, in the protein secondary structure prediction, there is no need to differentiate between the past and the future context. Bidirectional recurrent neural networks (BRNN), introduced in [47], extend the standard RNN architecture providing the solution to this limitation. Bidirectional RNNs combine an RNN that moves forward in time with an RNN that moves backward in time, thus using information from both the future and the past. This is achieved by presenting each sequence in two directions to two separate recurrent hidden layers connected to the same output layer, as shown in Figure 2. In the forward pass, sequence is processed in both the forward and the backward direction to two hidden layers, and the output is calculated after both hidden layers have processed the entire input sequence. In the backward pass, first all output layer errors need to be calculated and then back-propagated to two hidden layers in opposite directions [21]. Figure 2: Unfolded bidirectional recurrent neural network. Image from [21]. Figure 1: Unfolded recurrent neural network. Image from [20]. The main issue with RNNs is that they can be very difficult to train. When backpropagating errors across many time steps, the gradient tends to become very small (usually) or very large due to repeatedly multiplication of error terms. This leads to the problem known in literature as vanishing gradient [32] [7]. Even if we assume that parameters are stable, RNNs have a difficulty with learning long-term dependencies due to exponentially smaller weights given to long-term interactions [20]. Although in theory RNNs are able to learn long-range dependencies, in practice the range of context that can be accessed is limited. As the range of context increases, the gradient descent becomes increasingly inefficient [7]. Instead of modifying the learning algorithm, it is possible to modify the RNN architecture so it could store information for longer periods of time. This is the idea of the long short-term memory networks. connections): g(t) = tanh(Wxg x(t) + Whg h(t−1) + bg ) i(t) = σ(Wxi x(t) + Whi h(t−1) + bi ) f(t) = σ(Wxf x(t) + Whf h(t−1) + bf ) o(t) = σ(Wxo x(t) + Who h(t−1) + bo ) (2) c(t) = i(t) ◦ g(t) + f (t) ◦ c(t−1) h(t) = o(t) ◦ tanh(c(t) ), where i, f and o denote input, forget and output gate respectively, g denotes activation of the new input value, c denotes cell state and h denotes the output of the hidden state. Symbol ◦ denotes Hadamard product. B. Long Short-Term Memory Networks Long short-term memory networks, introduced in [34], represent the redesign of RNNs that overcome the vanishing gradient problem by enforcing the constant error flow. In LSTM network, units in the hidden layers are replaced by the memory blocks. Each memory block contains one or more memory cells and a pair of multiplicative gate units: input and output gate. Memory cell is the central unit of the memory block that stores a value over time. It is built around selfconnected linear unit that has a fixed weight of 1. This fixed self-connection is called the constant error carrousel (CEC) and ensures that, in the absence of the outside signal, the cell state can remain constant through time. Multiplicative gates enable memory cells to store and access information over long time periods. The input gates allow or block the forwardflowing activation to enter the memory cell, while the output gates allow or block the state of the memory cell to influence other neurons. The extension of this architecture was proposed in [18]. Authors discovered that the internal values of the cells could grow without bounds and proposed extending LSTM networks by replacing the CEC connections with the multiplicative forget gates. By adding forget gates, network may learn when to forget and when to rembember its previous cell state. This extension is now standard part of the LSTM architecture. Morevoer, in [17] the authors proposed possible extension by adding direct connections from the internal state to the input, output and forget gates named peephole connections. LSTM memory block with described extensions is shown in Figure 3. The gate activation function is usually the logistic sigmoid (denoted with f in the figure) so that the gate activations are between 0 and 1. The cell input and output activation functions (denoted with g and h) are usually hyperbolic tangent or sigmoid, although sometimes h may be the identity function [21]. The activation of an LSTM hidden layer at time t is calculated with the following equations (without peephole Figure 3: Memory block with one memory cell. Peephole connections are shown with dashed lines. All other connections have fixed weight of 1. Small black circles denote multiplication. Image from [21]. LSTM networks can be trained by using the back propagation through time. LSTM learns what to store in the memory and when to access it. When the input gate is closed (takes value around zero), new activation can not enter the cell and change the cell state. When the output gate is closed, the activation can not leave the cell and affect the rest of the network. This enables the network to remember values for a long time and thus capture the long term dependencies. Bidirectional LSTM networks (BLSTM) [27] are able to model long-range structure in two directions. BLSTM networks combine the LSTM architecture with the bidirectional RNNs. LSTM networks have solved complex artificial tasks that have not been solved by other RNN architectures [34]. They have also been successfuly applied in many domains requiring long-range memory, including speech recognition [23], handwriting recognition [24] and bioinformatics [33]. In [36] the authors measured the importance of LSTM gates and reported that the output gate is not important, the input gate is important, while the forget gate is extremely important on all problems except language modelling. Also, they emphasized that is very important to properly initialize the forget gate, because otherwise the LSTM may not be capable of solving problems that include long-range interactions. C. Gated recurrent unit Gated recurrent unit (GRU) were recently proposed in [9]. It is motivated by the LSTM unit, but simpler to compute. The GRU has two gates: update and reset gate. The GRU is defined by the following equations: r(t) = σ(Wxr x(t) + Whr h(t−1) + br ) z(t) = σ(Wxz x(t) + Whz h(t−1) + bz ) h̃(t) = tanh(Wxh x(t) + Whh (r(t) ◦ h(t−1) ) + bh ) (3) h(t) = z(t) ◦ h(t−1) + (1 − z(t) ) ◦ h̃(t) , where r denotes reset gate, z denotes update gate c. GRU and LSTM architectures were compared in [10] and [36]. In [10] the authors compared architectures on polyphonic music modeling and speech modeling and reported that the GRU outperformed the LSTM on the majority of datasets. In [36] the architectures were compared on three different tasks and the GRU outperformed the LSTM on all tasks with the exception of language modelling tasks. However, once that the bias of the LSTM forget gate was initialized to the large positive value, the LSTM was no more lagging behind the GRU. 3. R EPRESENTATION LEARNING In many domains we have an access to a large amount of data, but only a small part of the data is labeled. Representation learning is motivated by the idea of learning good representation from the unlabeled data and then using it for the supervised learning tasks. The good representation is the one that captures the unknown factors of variation in the training set distribution i.e. uncovers the latent factors that explain the observed variations in data, as discussed in [5] and [20]. When learning representation from unlabeled data, we hope that the features useful for the unsupervised task may also be useful for supervised learning tasks. This is true when there is a relationship between the input distribution P (X) and the target conditional distribution P (Y |X). Although in general this may not be the case, in many real world applications some of the factors that shape the input X are predictive of the output Y [4]. In cases where we have enough of labeled data, representation learning can be performed in a supervised manner. However, since the supervised signal tries to filter out the information irrelevant for the task at hand, the learned features are then task-specific and usually would not be successful when applied to a different task. Feedforward neural networks can be interpreted as performing representation learning: every hidden layer provides representation that makes the classification in the last layer of the network easier. Representation associated with the input is the pattern of activation of hidden units. Although the universal approximation theorem [12] states that the standard multilayer feed-forward networks with a single hidden layer that contains finite number of hidden neurons can approximate any function, in practice shallow network architectures are representational limited. As argued in [3], when a function can be represented by a deep architecture, it might need a very large architecture to be represented by the one that is not sufficiently deep. Deep neural networks tend to learn the hierarchy of features with the higher-level learned features composed of the lowerlevel features. The aim is to discover more abstract features in the higher levels of the representation that are invariant to local changes. The nice example can be demonstrated in the application to the object recognition. Low level features are edge detectors that can be combined to construct different local shapes, while the combination of local shapes is used to construct objects in the image. Another important aspect of deep architecture is the feature reuse. Features in the lower-levels of hierarchy are available and can be useful to a large group of the higer-level features. In deep architectures number of ways to reuse different features can grow exponentialy with a depth and new configurations of these features can be used to describe new concepts and generalize to the unseen data. This is the idea of distributed representations [3]. Unsupervised deep learning networks, just like their supervised counterparts, learn a representation as a byproduct of trying to optimize the objective function. The difference is only that in supervised case the learned representation should indicate factors of variation important for the supervised task, while in unsupervised case it should explain the observed variations in the input data. Deep neural networks require the optimization of highly non-convex functions and searching the parameter space of such functions represents a difficult optimization problem. Greedy layer-wise unsupervised pretraining, introduced in [30], has long been used to initialize deep neural networks. It is based on the learning single-layer modules layer by layer in an unsupervised manner. Each layer takes the output of the previous layer and produces a new representation. These layers are then stacked together and used to initialize deep fully connected network. The basic single-layer learning modules for multilayer architectures are autoencoders and Restricted Boltzmann machines (RBMs). By using autoencoders the network can be interpreted as a computation graph, while by using RMBs as a probabilistic graphical model [5]. In this review the emphasis is on the first approach. With the invention of new initialization techniques [19] and training approaches, such as rectifier liner unit (ReLU) [40], dropout [48], and batch normalization [35], the unsupervised pretraining is no longer required as a technique for training a fully connected deep network. One very succesful example of representation learning are word embeddings in the Natural Language Processing (NLP) domain [6]. In the original space, every word is represented using one-hot encoding and therefore, every pair of words share the same distance from each other. In the embedded space, words are represented by the vectors that encode similarity between words so that words that are semantically similar are close to each other. In [11] these word embeddings were learned on a large corpus of unlabeled texts from Wikipedia. Positive examples were original windows from Wikipedia, while the negative examples were the same windows but with the middle word replaced by a random word. The learned representation is now used to solve different NLP tasks. A. Autoencoders Autoencoder is a neural network that is trained to reconstruct its input in the output layer through an internal representation. Usually, we are not interested in the output of the decoder, but we hope that the internal representation will capture the structure of the input data The network consists of two parts: the encoder function fθ : X → F and the decoder function gθ : F → X , where X denotes input space and F denotes feature space. Encoder function fθ calculates representation vector or code h(i) of each input vector x(i) : h(i) = fθ (x(i) ). (4) Decoder function gθ produces a reconstruction x̃(i) of each input vector x(i) given a representation vector h(i) : x̃(i) = gθ (h(i) ). (5) The set of encoder and decoder parameters θ are optimized to minimize the reconstruction error: X L(x(i) , gθ (fθ (x(i) )), (6) i where L denotes a loss function. When f and g are linear functions and L is the mean squared error, autoencoder learns the same subspace as the PCA [5]. Encoder and decoder weights are usually tied. This restricts the model’s capacity and reduces the number of parameters to optimize. However, if we do not impose any other constraints on the network structure or the training criterion, the network could learn the identity function and just copy the input into the representation with zero reconstruction error everywhere. One way to avoid this is to constrain the representation vector to have a lower dimension than the input vector. This forces the autoencoder to learn a compressed representation of the input. Instead of limiting the capacity of the model, it is possible to change the objective function to include other properties such as representation sparsity, smallness of the derivative and robustness to noise [20]. Sparse autoencoders, introduced in [44], allow large number of hidden units but constrain them to be inactive most of the time. This forces autoencoder to learn sparse features. Representation sparsity is achieved by adding the sparsity penalty as a regularization term so that optimization objective becomes: X L(x(i) , gθ (fθ (x(i) )) + Ω(h), (7) i where Ω(h) denotes sparsity penalty term. Usually, the output of the hidden unit activations is penalized by using the L1 penalty: X Ω(h) = λ |hj |, (8) j or the Kullback-Liebler divergence with respect to the Bernoulli distribution [41]: X (1 − ρ) ρ , (9) + (1 − ρ) log Ω(h) = λ ρ log hj (1 − hj ) j where λ is a hyperparameter, ρ denotes sparsity parameter (typically small value close to zero) and hj is the average activation of the hidden unit j across examples. Denoising autoencoder [51] is an autoencoder that recieves a corrupted version of an input and is trained to reconstruct the original input from the corrupted one. In this way the autoencoder needs to learn, not only to reconstruct, but to denoise the input. In order to map the low-probability corrupted inputs to the clean inputs that have higher probability, the autoencoder is encouraged to learn the structure of the data. The denosing autoencoder works by first corrupting the original input with some corruption process and then the corrupted input is mapped to an internal representation from which the autoencoder needs to produce the original input. As with the basic autoencoder, the optimization objective is to minimize the reconstruction error: X L(x(i) , gθ (fθ (x̃(i) )), (10) i where x̃ denotes the corrupted version of the original input x. In [51] the corruption process was masking noise: parts of the input were erased, while other parts were left unchanged. Specifically, for each input x, a random chosen components were forced to zero and the autoencoder was trained to fill the missing input data. Other corruption noises can also be used, such as additive Gaussian noise and salt-and-pepper noise, considered in [52]. Contractive autoencoders [45] encourage the representation robustness to the small input changes by penalizing the sensitivity of the features to the small input perturbations. The penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. Low valued first derivatives lead to the robustness of the representation to small changes of the input. Minimizing only the Jacobian term would lead to learning a constant representation, but in the combination with the minimization of the reconstruction error, contractive autoencoders are enforced to learn a representation from which the input can be well reconstructed. The objective function of the contractive autoencoder is given by: X (L(x(i) , gθ (fθ (x(i) )) + λ||Jfθ (x(i) )||2F ), (11) i where λ is a hyperparameter and Jf (x) is the Jacobian matrix of the function f with respect to x and ||·||F denotes Frobenius norm. There is a relationship between the contractive and the denoising autoencoders. In particular, the denoising autoencoder with very small Gaussian corruption and squared error loss function can be seen as a particular kind of the contractive autoencoder, as shown in [1]. In the case of training a deep autoencoder, the usual strategy is layer-by-layer training and then these shallow autoencoders are stacked together [20]. This is successfuly done in [31] to reduce the dimensionality of the data. B. Representation learning for sequential data RNNs can be seen as deep networks when unfolded in time. However, the depth of RNNs is related to addressing the need for the memory and not for the construction of hierarchical features as in deep feedforward neural networks. At each time step of RNN, current input has only one hidden layer to pass before it can be processed to the output. From this point of view, RNN can be interpreted as having a shallow architecture. However, just like feedforward neural networks, RNNs can also benefit from depth and they have shown to outperform the shallow RNNs on different tasks. In [29] deep RNN was constructed by stacking shallow RNNs. Each layer of the network is an RNN and therefore recieves as input hidden state of the layer from the previous time step, but it also recieves the hidden state of the previous layer from the same time step. This architecture naturally combines recurrent and deep neural network architecture. The authors trained 5-layered network with stochastic gradient descent and reached state-of-the-art performance on the task of predicting the next character on the Wikipedia text-corpus. In [26] stacked bidirectional LSTM networks were employed and achieved the best recorded score on the phoneme recognition benchmark dataset. Besides stacking RNNs, in [42] other deep RNN architectures were proposed. RNNs can be made deep by adding layers between the hidden layer from the current time step and the hidden layer from the previous time step, by adding layers between the input layer and the hidden layer or by adding layers between the hidden and the output layer. Proposed architectures are visualized on Figure 4. It is possible to introduce shortcut connections from the input to all hidden layers and/or from all hidden layers to the output to make the training easier and to mitigate the vanishing gradient problem. The authors trained 2-3 layered networks for two different tasks: polyphonic music prediction and language modeling. In both cases the deep architecture outperformed the shallow one. The experiments suggested that each of the proposed RNNs architectures has different characteristics and is suitable for some datasets, so there is no clear winner between them. Also, the authors reported that the deep RNNs were difficult to train and that it may become even more problematic with the depth increase. However, the experiments suggest that deep RNNs can benefit from strategies used for training deep feedforward networks such as ReLU activations and dropout. Figure 4: Different architectures of deep RNN proposed in [42]. (a) Standard RNN. (b) Deep Transition (DT) RNN. (b*) DT-RNN with shortcut connections (c) Deep Transition, Deep Output (DOT) RNN. (d) Stacked RNN. Image from [42]. By training an RNN to predict the next element of the sequence, the network can learn a probability distribution over a sequence. In [22] RNN was used to generate new sequences containing long-range structure. The authors trained deep RNN obtained by stacking LSTM layers for the task of text prediction and online handwriting. RNN Encoder–Decoder that consists of two RNNs was proposed in [9] and [49]. The encoder maps a variable-length input sequence to a fixed-length representational vector, while the decoder maps the representation sequence to the target sequence. The encoder and decoder were jointly trained to maximize the conditional probability of a target sequence given a source sequence. This approach was applied to the machine translation task: translating from English to French language. Interesting thing to mention is that in [49] the authors proposed to reverse the order of the input sequence, but not the target sequence. They do not have a complete explanation why this works better, but they believe that it is caused by the fact that the first few words in the source language become very close to the first few words in the target language, and therefore it is easier to find relationship between the source sentence and the target sentence. Although the task of generating new sequences and machine translation are different from each other, the common thing is that both these approaches produce a fixed length summary of an input sequence in the last hidden layer. Moreover, in [9] the authors proved that the proposed model learns meaningful representations. 4. A PPLICATION IN BIOINFORMATICS With the development of high-throughput DNA and protein sequencing technologies, the number of sequenced genomes has rapidly increased. This explosion of growth of sequenced data provides the opportunity for improving the performance on bioinformatics tasks, such as modeling protein structure and function annotation. As an illustration, less than 1/1000 of the sequenced proteins has assigned structure [39] and predicting the protein structure from its sequence remains one of the major challenges in the field. Instead of hand-crafting features, this wealth of the sequenced data could be used to construct meaningful representations that could than be used to describe complex biological processes. The favorable representation is the one that takes into account the local, as well as the longrange interactions. Despite of the great success of deep learning algorithms in other domains, they have been applied in a modest number of works in bioinformatics. In [13] and [16] deep architecture was applied for the task of protein contact map prediction. The proposed approach in [13] consisted of three steps. First, BRNN was used for the prediction of contacts between secondary structure elements, so called coarse contacts. In the second step, energy based model was used to predict energy between residues by aligning the sequences. Finally, the deep architecture was employed to predict residue-residue contacts using as features the outputs of the previous steps, as well as the residue-residue features such as secondary structure and evolutionary information. These were the spatial features of the model. In order to obtain the temporal features, feedforward networks were stacked so that each layer in the stack predicted a contact map, which was then used as the input in the subsequent layers. The goal of using temporal input features was the refinement of predictions in each new layer. The authors reported to achieve the accuracy close to 30% for ab initio long-range contacts predictions, beating the previous best predictors. In [16] the authors used the boosted ensembles of deep network classifiers to predict contact maps. Features included sequence specific values for the residues in two windows centered around the residue–residue contact pair, global features and values characterizing the sequence between the contact pair. The network was initialized using unsupervised pretraining. Deep convolutional generative stochastic network was introduced in [54]. The network was trained for the 8-state protein secondary structure prediction problem, outperforming the previous methods. In [8] deep autoencoder outperformed previous methods for the prediction of gene ontology annotation that can be viewed as a matrix completion problem. Furthermore, deep learning model that can predict splicing patterns in tissues inferred from mouse RNA-Seq data was proposed in [37]. Inputs into the model consisted of tissue types and genomic features describing an exon, neighboring introns and adjacent exons. The network had three hidden layers, but the first hidden layer was used only for genomic features and trained as an autoencoder. The tasks were the prediction of the discretized percentage of transcripts with an exon spliced in (PSI) and the prediction of discretized PSI difference between two tissues. The authors reported that the use of dropout improved the model performance. Compared to the previous best method, the developed model was comparable or significantly better for all tissue types. Recently, in [55] the authors developed DeepSea, deep learning framework for the prediction of the epigenetic state of a sequence. Each training example was 1000-bp nucleotide sequence encoded as 1000 × 4 binary matrix where columns corresponded to four nucleotides, while targets were binary encoded chromatin features. The authors trained deep convolutional neural network and showed that it predicts chromatin features with the high accuracy. On the other hand, shallow and recurrent neural networks have a long history of successful applications, especially for the tasks of protein structure prediction. Standard feedforward neural networks typically use a sliding window approach centered around target residue, as in [15] for the secondary structure prediction and in [43] for the contact map prediction. The other often used approach is to extract hand-designed features from the sequence and then apply learning algorithm on the fixed-dimensional feature vectors, as done in [14] for the protein fold recognition. The bidirectional recurrent neural networks have also been used to solve different problems involving biological sequences, such as in [2] for the secondary structure prediction and in [50] for the contact map prediction. 5. C ONCLUSION Deep neural networks have shown a great success in a variety of domains such as object recognition, speech recognition and natural language processing, but the application for solving the problems in bioinformatics lags far behind. Prediction of complex biological processes by using hand-designed features of sequential data has led to weak performance on a number of tasks. The abundant amounts of sequenced data which are nowadays available could be exploited by using deep learning algorithms to construct meaningful features. This provides the opportunity to interpret complex biological processes that are currently beyond our understanding. R EFERENCES [1] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data generating distribution. In ICLR’2013, 2013. [2] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and Gi. Soda. Bidirectional Dynamics for Protein Secondary Structure Prediction. Lecture Notes in Computer Science, 1828:80–104, 2001. [3] Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning, 2(1):1–127, 2009. [4] Y. Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning. JMLR: Workshop and Conference Proceedings 7, 7:1–20, 2011. [5] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8):1798–1828, 2013. [6] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3:1137–1155, 2003. [7] Y. Bengio, P. Simard, and P. Frasconi. Learning Long Term Dependencies with Gradient Descent is Difficult, 1994. [8] D. Chicco, P. Sadowski, and P. Baldi. Deep autoencoder neural networks for gene ontology annotation predictions. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB ’14, pages 533–540, 2014. [9] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014. [10] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv, pages 1–9, 2014. [11] R. Collobert and J. Weston. A unified architecture for natural language processing. Proceedings of the 25th international conference on Machine learning, 20:160–167, 2008. [12] G. Cybenko. Degree of approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989. [13] P. Di Lena, K. Nagata, and P. Baldi. Deep architectures for protein contact map prediction. Bioinformatics, 28(19):2449–2457, 2012. [14] C. H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001. [15] O. Dor and Y. Zhou. Achieving 80% Ten-fold Cross-validated Accuracy for Secondary Structure Prediction by Large-scale Training. Proteins, 66:838–845, 2007. [16] J. Eickholt and J. Cheng. Predicting protein residue-residue contacts using deep networks and boosting. Bioinformatics, 28(23):3066–3072, 2012. [17] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 1:189–194 vol.3, 2000. [18] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: continual prediction with LSTM. Neural computation, 12:2451–2471, 2000. [19] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 9:249– 256, 2010. [20] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016. [21] A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Image Rochester NY, page 124, 2008. [22] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, pages 1–43, 2013. [23] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd international conference on Machine Learning, pages 369–376, 2006. [24] A. Graves, S. Fernández, and M. Liwicki. Unconstrained online handwriting recognition with recurrent neural networks. Advances in Neural Information Processing Systems, 20:1–8, 2008. [25] A. Graves, S. Fernández, and J. Schmidhuber. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. The 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II, pages 799–804, 2005. [26] A. Graves, A. Mohamed, and G. E. Hinton. Speech Recognition With Deep Recurrent Neural Networks. ICASSP, (3):6645–6649, 2013. [27] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM networks. Proceedings of the International Joint Conference on Neural Networks, 4:2047–2052, 2005. [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. Arxiv.Org, 7:171–180, 2015. [29] M. Hermans and B. Schrauwen. Training and Analyzing Deep Recurrent Neural Networks. In NIPS 2013, pages 190–198, 2013. [30] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18:1527–54, 2006. [31] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science (New York, N.Y.), 313(July):504– 507, 2006. [32] S Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis, Institut fur Informatik, Technische Universitat, Munchen, 1991. [33] S. Hochreiter, M. Heusel, and K. Obermayer. Fast model-based protein homology detection without alignment. Bioinformatics, 23(14):1728– 1736, 2007. [34] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–80, 1997. [35] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv, 2015. [36] R. Jozefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent Network Architectures. Proceedings of the 32nd international conference on Machine Learning - ICML ’15, 2015. [37] M. K. Leung, H.Y. Xiong, L. J. Lee, and B. J. Frey. Deep learning of the tissue-regulated splicing code. Bioinformatics, 30:121–129, 2014. [38] Z. Lipton, J. Berkowitz, and C. Elkan. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv preprint, pages 1–35, 2015. [39] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A. Tramontano. Critical assessment of methods of protein structure prediction (CASP)– round x. Proteins, 82 Suppl 2(0 2):1–6, 2014. [40] V. Nair and G. E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. Proceedings of the 27th International Conference on Machine Learning, (3):807–814, 2010. [41] A. Ng. Sparse autoencoder. CS294A Lecture notes, pages 1–19, 2011. [42] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to Construct Deep Recurrent Neural Networks. In ICLR’2014, 2014. [43] M. Punta and B. Rost. PROFcon: Novel prediction of long-range contacts. Bioinformatics, 21(13):2960–2968, 2005. [44] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient Learning of Sparse Representations with an Energy-Based Model. Advances in Neural Information Processing Systems, pages 1137–1144, 2007. [45] S. Rifai and X. Muller. Contractive Auto-Encoders : Explicit Invariance During Feature Extraction. Proceedings of the 28th International Conference on Machine Learning, 85(1):833–840, 2011. [46] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation, 1986. [47] M. Schuster and K. K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997. [48] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (JMLR), 15:1929–1958, 2014. [49] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS), pages 3104–3112, 2014. [50] A. N. Tegge, Z. Wang, J. Eickholt, and J. Cheng. NNcon: Improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Research, 37(May):515–518, 2009. [51] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning, pages 1096– 1103, 2008. [52] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11:3371–3408, 2010. [53] P. J. Werbos. Backpropagation Through Time: What It Does and How to Do It. Proceedings of the IEEE, 78(October):1550–1560, 1990. [54] J. Zhou and O. G. Troyanskaya. Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction. Proceedings of the 31st International Conference on Machine Learning, 32:745–753, 2014. [55] J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. Nature methods, 12(August):931–4, 2015.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement