Linköping Studies in Science and Technology. Dissertations No. 531 Learning Multidimensional Signal Processing Magnus Borga Department of Electrical Engineering Linköping University, S-581 83 Linköping, Sweden Linköping 1998 Learning Multidimensional Signal Processing c 1998 Magnus Borga Department of Electrical Engineering Linköping University S-581 83 Linköping Sweden ISBN 91-7219-202-X ISSN 0345-7524 iii Abstract The subject of this dissertation is to show how learning can be used for multidimensional signal processing, in particular computer vision. Learning is a wide concept, but it can generally be defined as a system’s change of behaviour in order to improve its performance in some sense. Learning systems can be divided into three classes: supervised learning, reinforcement learning and unsupervised learning. Supervised learning requires a set of training data with correct answers and can be seen as a kind of function approximation. A reinforcement learning system does not require a set of answers. It learns by maximizing a scalar feedback signal indicating the system’s performance. Unsupervised learning can be seen as a way of finding a good representation of the input signals according to a given criterion. In learning and signal processing, the choice of signal representation is a central issue. For high-dimensional signals, dimensionality reduction is often necessary. It is then important not to discard useful information. For this reason, learning methods based on maximizing mutual information are particularly interesting. A properly chosen data representation allows local linear models to be used in learning systems. Such models have the advantage of having a small number of parameters and can for this reason be estimated by using relatively few samples. An interesting method that can be used to estimate local linear models is canonical correlation analysis (CCA). CCA is strongly related to mutual information. The relation between CCA and three other linear methods is discussed. These methods are principal component analysis (PCA), partial least squares (PLS) and multivariate linear regression (MLR). An iterative method for CCA, PCA, PLS and MLR, in particular low-rank versions of these methods, is presented. A novel method for learning filters for multidimensional signal processing using CCA is presented. By showing the system signals in pairs, the filters can be adapted to detect certain features and to be invariant to others. A new method for local orientation estimation has been developed using this principle. This method is significantly less sensitive to noise than previously used methods. Finally, a novel stereo algorithm is presented. This algorithm uses CCA and phase analysis to detect the disparity in stereo images. The algorithm adapts filters in each local neighbourhood of the image in a way which maximizes the correlation between the filtered images. The adapted filters are then analysed to find the disparity. This is done by a simple phase analysis of the scalar product of the filters. The algorithm can even handle cases where the images have different scales. The algorithm can also handle depth discontinuities and give multiple depth estimates for semi-transparent images. iv To Maria v Acknowledgements This thesis is the result of many years work and it would never have been possible for me to accomplish this without the help, support and encouragements from a lot of people. First of all, I would like to thank my supervisor, associate professor Hans Knutsson. His enthusiastic engagement in my research and his never ending stream of ideas has been absolutely essential for the results presented here. I am very grateful that he has spent so much time with me discussing different problems ranging from philosophical issues down to minute technical details. I would also like to thank professor Gösta Granlund for giving me the opportunity to work in his research group and for managing a laboratory it is a pleasure to work in. Many thanks to present and past members of the Computer Vision Laboratory for being good friends as well as helpful colleagues. In particular, I would like to thank Dr. Tomas Landelius with whom I have been working very close in most of the research presented here as well as in the (not yet finished) systematic search for the optimum malt whisky. His comments on large parts of the early versions of the manuscript have been very valuable. I would also like to thank Morgan Ulvklo and Dr. Mats Andersson for constructive comments on parts of the manuscript. Dr. Mats Anderson’s help with a lot of technical details ranging from the design of quadrature filters to welding is also very appreciated. Finally, I would like to thank my wife Maria for her love, support and patience. Maria should also have great credit for proof-reading my manuscript and helping me with the English. All remaining errors are to be blamed on me, due to final changes. The research presented in this thesis was sponsored by NUTEK (Swedish National Board for Industrial and Technical Development) and TFR (Swedish Research Council for Engineering Sciences), which is gratefully acknowledged. vi Contents 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 I Learning 5 2 Learning systems 2.1 Learning . . . . . . . . . . . . . . . . . . . . . . 2.2 Machine learning . . . . . . . . . . . . . . . . . 2.3 Supervised learning . . . . . . . . . . . . . . . . 2.3.1 Gradient search . . . . . . . . . . . . . . 2.3.2 Adaptability . . . . . . . . . . . . . . . 2.4 Reinforcement learning . . . . . . . . . . . . . . 2.4.1 Searching for higher rewards . . . . . . . 2.4.2 Generating the reinforcement signal . . . 2.4.3 Learning in an evolutionary perspective . 2.5 Unsupervised learning . . . . . . . . . . . . . . 2.5.1 Hebbian learning . . . . . . . . . . . . . 2.5.2 Competitive learning . . . . . . . . . . . 2.5.3 Mutual information based learning . . . . 2.6 Comparisons between the three learning methods 2.7 Two important problems . . . . . . . . . . . . . 2.7.1 Perceptual aliasing . . . . . . . . . . . . 2.7.2 Credit assignment . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 9 10 11 12 14 20 22 23 24 26 28 32 33 33 35 Information representation 37 3.1 The channel representation . . . . . . . . . . . . . . . . . . . . . 39 3.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 viii Contents 3.3 Linear models . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The prediction matrix memory . . . . . . . . . Local linear models . . . . . . . . . . . . . . . . . . . Adaptive model distribution . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Q-learning with the prediction matrix memory 3.6.2 TD-learning with local linear models . . . . . 3.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 51 52 53 54 54 57 Low-dimensional linear models 4.1 The generalized eigenproblem . . . . . . . . . . . . . 4.2 Principal component analysis . . . . . . . . . . . . . . 4.3 Partial least squares . . . . . . . . . . . . . . . . . . . 4.4 Canonical correlation analysis . . . . . . . . . . . . . 4.4.1 Relation to mutual information and ICA . . . . 4.4.2 Relation to SNR . . . . . . . . . . . . . . . . 4.5 Multivariate linear regression . . . . . . . . . . . . . . 4.6 Comparisons between PCA, PLS, CCA and MLR . . . 4.7 Gradient search on the Rayleigh quotient . . . . . . . . 4.7.1 PCA . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 PLS . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 CCA . . . . . . . . . . . . . . . . . . . . . . 4.7.4 MLR . . . . . . . . . . . . . . . . . . . . . . 4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Comparisons to optimal solutions . . . . . . . 4.8.2 Performance in high-dimensional signal spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 61 64 66 67 70 70 73 75 78 82 83 84 85 87 87 92 3.4 3.5 3.6 4 II Applications in computer vision 5 6 Computer vision 5.1 Feature hierarchies . . . . 5.2 Phase and quadrature filters 5.3 Orientation . . . . . . . . 5.4 Frequency . . . . . . . . . 5.5 Disparity . . . . . . . . . 97 . . . . . 99 99 100 101 103 103 Learning feature descriptors 6.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Learning quadrature filters . . . . . . . . . . . . . . . . . 6.1.2 Combining products of filter outputs . . . . . . . . . . . . 107 110 110 115 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 6.2 7 8 ix Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Disparity estimation using CCA 7.1 The canonical correlation analysis part 7.2 The phase analysis part . . . . . . . . 7.2.1 The signal model . . . . . . . 7.2.2 Multiple disparities . . . . . . 7.2.3 Images with different scales . 7.3 Experiments . . . . . . . . . . . . . . 7.3.1 Discontinuities . . . . . . . . 7.3.2 Scaling . . . . . . . . . . . . 7.3.3 Semi-transparent images . . . 7.3.4 An artificial scene . . . . . . 7.3.5 Real images . . . . . . . . . . 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 122 123 125 127 128 129 129 131 132 134 134 138 Epilogue 145 8.1 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . 145 8.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 A Definitions 151 A.1 The vec function . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.2 The mtx function . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.3 Correlation for complex variables . . . . . . . . . . . . . . . . . 152 B Proofs B.1 Proofs for chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 The differential entropy of a multidimensional Gaussian variable . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Proofs for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 The constant norm of the channel set . . . . . . . . . . B.2.2 The constant norm of the channel derivatives . . . . . . B.2.3 Derivation of the update rule for the prediction matrix memory . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.4 One frequency spans a 2-D plane . . . . . . . . . . . . B.3 Proofs for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Orthogonality in the metrics A and B . . . . . . . . . . B.3.2 Linear independence . . . . . . . . . . . . . . . . . . . B.3.3 The range of r . . . . . . . . . . . . . . . . . . . . . . B.3.4 The second derivative of r . . . . . . . . . . . . . . . . B.3.5 Positive eigenvalues of the Hessian . . . . . . . . . . . 153 . 153 . . . . 153 154 154 155 . . . . . . . . 156 156 157 157 158 158 159 159 x Contents B.3.6 B.3.7 B.3.8 B.3.9 The partial derivatives of the covariance . . . . . . . . . The partial derivatives of the correlation . . . . . . . . . Invariance with respect to linear transformations . . . . Relationship between mutual information and canonical correlation . . . . . . . . . . . . . . . . . . . . . . . . B.3.10 The partial derivatives of the MLR-quotient . . . . . . . B.3.11 The successive eigenvalues . . . . . . . . . . . . . . . . B.4 Proofs for chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . B.4.1 Real-valued canonical correlations . . . . . . . . . . . . B.4.2 Hermitian matrices . . . . . . . . . . . . . . . . . . . . . 160 . 160 . 161 . . . . . . 162 163 164 165 165 165 Chapter 1 Introduction This thesis deals with two research areas: learning and multidimensional signal processing. A typical example of a multidimensional signal is an image. An image is usually described in terms of pixel1 values. A monochrome TV image has a resolution of approximately 700 500 pixels, which means that it is a 350,000dimensional signal. In computer vision, we try to instruct a computer how to extract the relevant information from this huge signal in order to solve a certain task. This is not an easy problem! The information is extracted by estimating certain local features in the image. What is “relevant information” depends, of course, on the task. To describe what features to estimate and how to estimate them are possible only for highly specific tasks, which, for a human, seem to be trivial in most cases. For more general tasks, we can only define these feature detectors on a very low level, such as line and edge detectors. It is commonly accepted that it is difficult to design higher-level feature detectors. In fact, the difficulty arises already when trying to define what features are important to estimate. Nature has solved this problem by making the visual system adaptive. In other words, we learn how to see. We know that many of the low-level feature detectors used in computer vision are similar to those found in the mammalian visual system (Pollen and Ronner, 1983). Since we generally do not know how to handle multidimensional signals on a high level and since our solutions on a low level are similar to those of nature, it seems rational also on a higher level to use nature’s solution: learning. Learning in artificial systems is often associated with artificial neural networks. Note, however, that the term “neural network” refers to a specific type of architecture. In this work we are more interested in the learning capabilities than the hardware implementation. What we mean by “learning systems” is discussed in the next chapter. 1 Pixel is an abbreviation for Picture Element. 2 Introduction The learning process can be seen as a way of finding adaptive models to represent relevant parts of the signal. We believe that local low-dimensional linear models are sufficient and efficient for representation in many systems. The reason for this is that most real-world signals are (at least piecewise) continuous due to the dynamic of the world that generates them. Therefore it can be justified to look at some criteria for choosing low-dimensional linear models. In the field of signal processing there seems to be a growing interest in methods related to independent component analysis. In the learning and neural network society, methods based on maximizing mutual information are receiving more attention. These two methods are related to each other and they are also related to a statistical method called canonical correlation analysis, which can be seen as a linear special case of maximum mutual information. Canonical correlation analysis is also related to principal component analysis, partial least squares and multivariate linear regression. These four analysis methods can be seen as different choices of linear models based on different optimization criteria. Canonical correlation turns out to be a useful tool in several computer vision problems as a new way of constructing and combining filters. Some examples of this are presented in this thesis. We believe that this approach provides a basis for new efficient methods in multidimensional signal processing in general and in computer vision in particular. 1.1 Contributions The main contributions in this thesis are presented in chapters 3, 4, 6 and 7. Chapters 2 and 5 should be seen as introductions to learning systems and computer vision respectively. The most important individual contributions are: A unified framework for principal component analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA) and multiple linear regression (MRL) (chapter 4). An iterative gradient search algorithm that successively finds the eigenvalues and the corresponding eigenvectors to the generalized eigenproblem. The algorithm can be used for the special cases PCA, PLS, CCA and MLR (chapter 4). A method for using canonical correlation for learning feature detectors in high-dimensional signals (chapter 6). By this method, the system can also learn how to combine estimates in a way that is less sensitive to noise than the previously used vector averaging method. A stereo algorithm based on canonical correlation and phase analysis that 1.2 Outline 3 can find correlation between differently scaled images. The algorithm can handle depth discontinuities and estimate multiple depths in semi-transparent images (chapter 7). The TD-algorithm presented in section 3.6.2 was presented at ICANN’93 in Amsterdam (Borga, 1993). Most of the contents in chapter 4 have been submitted for publication in Information Sciences (Borga et al., 1997b, revised for second review). The canonical correlation algorithm in section 4.7.3 and most of the contents in chapter 6 were presented at SCIA’97 in Lappeenranta, Finland (Borga et al., 1997a). Finally, the stereo algorithm in chapter 7 has been submitted to ICIPS’98 (Borga and Knutsson, 1998). Large parts of chapter 2 except the section on unsupervised learning (2.5), most of chapter 3 and some of the theory of canonical correlation in chapter 4 were presented in “Reinforcement Learning Using Local Adaptive Models” (Borga, 1995, licentiate thesis) . 1.2 Outline The thesis is divided into two parts. Part I deals with learning theory. Part II describes how the theory discussed in part I can be applied in computer vision. In chapter 2, learning systems are discussed. Chapter 2 can be seen as an introduction and overview of this subject. Three important principles for learning are described: reinforcement learning, unsupervised learning and supervised learning. In chapter 3, issues concerning information representation are treated. Linear models and, in particular, local linear models are discussed and two examples are presented that use linear models for reinforcement learning. Four low-dimensional linear models are discussed in chapter 4. They are lowrank versions of principal component analysis, partial least squares, canonical correlation and multivariate linear regression. All these four methods are related to the generalized eigenproblem and the solutions can be found by maximizing a Rayleigh quotient. An iterative algorithm for solving the generalized eigenproblem in general and these four methods in particular is presented. Chapter 5 is a short introduction to computer vision. It treats the concepts in computer vision relevant for the remaining chapters. In chapter 6 is shown how canonical correlation can be used for learning models that represent local features in images. Experiments show how this method can be used for finding filter combinations that decrease the noise-sensitivity compared to vector averaging while maintaining spatial resolution. In chapter 7, a novel stereo algorithm based on the method from chapter 6 is presented. Canonical correlation analysis is used to adapt filters in a local image 4 Introduction neighbourhood. The adapted filters are then analysed with respect to phase to get the disparity estimate. The algorithm can handle differently scaled image pairs and depth discontinuities. It can also estimate multiple depths in semi-transparent images. Chapter 8 is a summary of the thesis and also contains some thoughts on future research. Finally there are two appendices. Appendix A contains definitions. In appendix B, most of the proofs have been placed. In this way, the text is hopefully easier to follow for the reader who does not want to get too deep into mathematical details. This also makes it possible to give the proofs space enough to be followed without too much effort and to include proofs that initiated readers may consider unnecessary without disrupting the text. 1.3 Notation Lowercase letters in italics (x) are used for scalars, lowercase letters in boldface (x) are used for vectors and uppercase letters in boldface (X) are used for matrices. The transpose of a real valued vector or a matrix is denoted xT . The conjugate transpose is denoted x . The norm kvk of a vector v is defined by kvk p v v and a “hat” (v̂) indicates a vector with unit length, i.e. v̂ v kvk : E [] means expectation value of a stochastic variable. Part I Learning Chapter 2 Learning systems Learning systems is a central concept in this dissertation and in this chapter, three different principles of learning are described. Some standard techniques are described and some important issues related to machine learning are discussed. But first, what is learning? 2.1 Learning According to Oxford Advanced Learner’s Dictionary (Hornby, 1989), learning is to “gain knowledge or skill by study, experience or being taught.” Knowledge may be considered as a set of rules determining how to act. Hence, knowledge can be said to define a behaviour which, according to the same dictionary, is a “way of acting or functioning.” Narendra and Thathachar (1974), two learning automata theorists, make the following definition of learning: “Learning is defined as any relatively permanent change in behaviour resulting from past experience, and a learning system is characterized by its ability to improve its behaviour with time, in some sense towards an ultimate goal.” Learning has been a field of study since the end of the nineteenth century. Thorndike (1898) presented a theory in which an association between a stimulus and a response is established and this association is strengthened or weakened depending on the outcome of the response. This type of learning is called operant conditioning. The theory of classical conditioning (Pavlov, 1955) is concerned with the case when a natural reflex to a certain stimulus becomes a response of a second stimulus that has preceded the original stimulus several times. 8 Learning systems In the 1930s, Skinner developed Thorndike’s ideas but claimed, as opposed to Thorndike, that learning was more ”trial and success” than ”trial and error” (Skinner, 1938). These ideas belong to the psychological position called behaviourism. Since the 1950s, rationalism has gained more interest. In this view, intentions and abstract reasoning play an important role in learning. In this thesis, however, there is a more behaviouristic view. The aim is not to model biological systems or mental processes. The goal is rather to make a machine that produces the desired results. As will be seen, the learning principle called reinforcement learning discussed in section 2.4 has much in common with Thorndike’s and Skinner’s operant conditioning. Learning theories have been thoroughly described for example by Bower and Hilgard (1981). There are reasons to believe that ”learning by doing” is the only way of learning to produce responses or, as stated by Brooks (1986): “These two processes of learning and doing are inevitably intertwined; we learn as we do and we do as well as we have learned.” An example of ”learning by doing” is illustrated in an experiment (Held and Bossom, 1961; Mikaelian and Held, 1964) where people wearing goggles that rotated or displaced their fields of view were either walking around for an hour or wheeled around the same path in a wheel-chair for the same amount of time. The adaptation to the distortion was then tested. The subjects that had been walking had adapted while the other subjects had not. A similar situation occurs for instance when you are going somewhere by car. If you have driven to a certain destination before, instead of being a passenger, you probably will find your way easier the next time. 2.2 Machine learning We are used to seeing humans and animals learn, but how does a machine learn? The answer depends on how knowledge or behaviour is represented in the machine. Let us consider knowledge to be a rule for how to generate responses to certain stimuli. One way of representing knowledge is to have a table with all stimuli and corresponding responses. Learning would then take place if the system, through experience, filled in or changed the responses in the table. Another way of representing knowledge is by using a parameterized model, where the output is obtained as a given function of the input x and a parameter vector w: y = f (x; w): (2.1) Learning would then be to change the model parameters in order to improve the performance. This is the learning method used for example in neural networks. 2.3 Supervised learning 9 Another way of representing knowledge is to consider the input space and output space together. Examples of this approach are an algorithm by Munro (1987) and the Q-learning algorithm (Watkins, 1989). Another example is the prediction matrix memory described in section 3.3.1. The combined space of input and output can be called the decision space, since this is the space in which the combinations of input and output (i.e. stimuli and responses) that constitute decisions exist. The decision space could be treated as a table in which suitable decisions are marked. Learning would then be to make or change these markings. Or the knowledge could be represented in the decision space as distributions describing suitable combinations of stimuli and responses (Landelius, 1993, 1997): p(y; x; w) (2.2) where, again, y is the response, x is the input signal and w contains the parameters of a given distribution function. Learning would then be to change the parameters of these distributions through experience in order to improve some measure of performance. Responses can then be generated from the conditional probability function p(y j x; w): (2.3) The issue of representing knowledge is further discussed in chapter 3. Obviously a machine can learn through experience by changing some parameters in a model or data in a table. But what is the experience and what measure of performance is the system trying to improve? In other words, what is the system learning? The answers to these questions depend on what kind of learning we are talking about. Machine learning can be divided into three classes that differ in the external feedback to the system during learning: Supervised learning Reinforcement learning Unsupervised learning The three different principles are illustrated in figure 2.1. In the following three sections, these three principles of learning are discussed in more detail. In section 2.6, the relations between the three methods are discussed and it is shown that the differences are not as great as they may seem at first. 2.3 Supervised learning In supervised learning there is a teacher who shows the system the desired responses for a representative set of stimuli (see figure 2.1). Here, the experience 10 Learning systems d x - ? r -y (a) x - ? -y (b) x - -y (c) Figure 2.1: The three different principles of learning: Supervised learning (a), Reinforcement learning (b) and Unsupervised learning (c). is pairs of stimuli and desired responses and improving performance means minimizing some error measure, for example the mean squared distance between the system’s output and the desired output. Supervised learning can be described as function approximation. The teacher delivers samples of the function and the algorithm tries, by adjusting the parameters w in equation 2.1 or equation 2.2, to minimize some cost function E = E [ε]; (2.4) where E [ε] stands for the expectation of costs ε over the distribution of data. The instantaneous cost ε depends on the difference between the output of the algorithm and the samples of the function. In this sense, regression techniques can be seen as supervised learning. In general, the cost function also includes a regularization term. The regularization term prevents the system from what is called over-fitting. This is important for the generalization capabilities of the system, i.e. the performance of the system for new data not used for training. In effect, the regularization term can be compared to the polynomial degree in polynomial regression. 2.3.1 Gradient search Most supervised learning algorithms are based on gradient search on the cost function. Gradient search means that the parameters wi are changed a small step in the opposite direction of the gradient of the cost function E for each iteration of the process, i.e. wi (t + 1) = wi (t ) , α ∂E ; ∂wi (2.5) where the update factor α is used to control the step length. In general, the negative gradient does of course not point exactly towards the minimum of the cost 2.3 Supervised learning 11 function. Hence, a gradient search will in general not find the shortest way to the optimum. There are several methods to improve the search by using the second-order partial derivatives (Battiti, 1992). Two well-known methods are Newton’s method (see for example Luenberger, 1969) and the conjugate-gradient method (Fletcher and Reeves, 1964). Newton’s method is optimal for quadratic cost functions in the sense that it, given the Hessian (i.e. the matrix of second order partial derivatives), can find the optimum in one step. The problem is the need for calculation and storage of the Hessian and its inverse. The calculation of the inverse requires the Hessian to be non-singular which is not always the case. Furthermore, the size of the Hessian grows quadratically with the number of parameters. The conjugategradient method is also a second-order technique but avoids explicit calculation of the second-order partial derivatives. For an n-dimensional quadratic cost function it reaches the optimum in n steps, but here each step includes a line search which increases the computational complexity in each step. A line search can of course also be performed in first-order gradient search. Such a method is called steepest descent. In steepest descent, however, the profit from the line search is not so big. The reason for this is that two successive steps in steepest descent are always perpendicular and, hence, the parameter vector will in general move in a zigzag path. In practice, the true gradient of the cost function is, in most cases, not known since the expected cost E is unknown. In these cases, an instantaneous sample ε(t ) of the cost function can be used and the parameters are changed according to wi (t + 1) = wi (t ) , α ∂ε(t ) : ∂wi (t ) (2.6) This method is called stochastic gradient search since the gradient estimate varies with the (stochastic) data and the estimate improves on average with an increasing number of samples (see for example Haykin, 1994). 2.3.2 Adaptability The use of instantaneous estimates of the cost function is not necessarily a disadvantage. On the contrary, it allows for system adaptability. Instantaneous estimates permit the system to handle non-stationary processes, i.e. the cost function is changing over time. The choice of the update factor α is crucial for the performance of stochastic gradient search. If the factor is too large, the algorithm will start oscillating and never converge and if the factor is too small, the convergence time will be far too long. In the literature, the factor is often a decaying function of time. The intuitive reason for this is that the more samples the algorithm has used, the closer 12 Learning systems the parameter vector should be to the optimum and the smaller the steps should be. But, in most cases, the real reason for using a time-decaying update factor is probably that it makes it easier to prove convergence. In practice, however, choosing α as a function of time only is not a very good idea. One reason is that the optimal rate of decay depends on the problem, i.e. the shape of the cost function, and is therefore impossible to determine beforehand. Another important reason is adaptability. A system with an update factor that decays as a function of time only cannot adapt to new situations. Once the parameters have converged, the system is fixed. In general, a better solution is to use an adaptive update factor that enables the parameters to change in large steps when consistently moving towards the optimum and to decrease the steps when the parameter vector is oscillating around the optimum. One example of such methods is the Delta-Bar-Delta rule (Jacobs, 1988). This algorithm has a separate adaptive update factor αi for each parameter. Another fundamental reason for adaptive update factors, not often mentioned in the literature, is that the step length in equation 2.6 is proportional to the norm of the gradient. It is, however, only the direction of the gradient that is relevant, not the norm. Consider, for example, finding the maximum of a Gaussian by moving proportional to its gradient. Except for a region around the optimum, the step length gets smaller the further we get from the optimum. A method that deals with this problem is the RPROP algorithm (Riedmiller and Braum, 1993) which adapts the actual step lengths of the parameters and not just the factors αi . 2.4 Reinforcement learning In reinforcement learning there is a teacher too, but this teacher does not give the desired responses. Only a scalar reward or punishment (reinforcement signal) according to the quality of the system’s overall performance is fed back to the system, as illustrated in figure 2.1 on page 10. In this case, each experience is a triplet of stimulus, response and corresponding reinforcement. The performance to improve is simply the received reinforcement. What is meant by received reinforcement depends on whether or not the system acts in a closed loop, i.e. the input to the system or the system state is dependent on previous output. If there is a closed loop, an accumulated reward over time is probably more important than each instant reward. If there is no closed loop, there is no conflict between maximizing instantaneous reward and accumulated rewards. The feedback to a reinforcement learning system is evaluative rather than instructive, as in supervised learning. The reinforcement signal is in most cases easier to obtain than a set of correct responses. Consider, for example, the situation when a child learns to bicycle. It is not possible for the parents to explain 2.4 Reinforcement learning 13 to the child how it should behave, but it is quite easy to observe the trials and conclude how good the child manages. There is also a clear (though negative) reinforcement signal when the child fails. The simple feedback is perhaps the main reason for the great interest in reinforcement learning in the fields of autonomous systems and robotics. The teacher does not have to know how the system should solve a task but only be able to decide if (and perhaps how good) it solves it. Hence, a reinforcement learning system requires feedback to be able to learn, but it is a very simple form of feedback compared to what is required for a supervised learning system. In some cases, the teacher’s task may even become so simple that it can be built into the system. For example, consider a system that is only to learn to avoid heat. Here, the teacher may consist only of a set of heat sensors. In such a case, the reinforcement learning system is more like an unsupervised learning system than a supervised one. For this reason, reinforcement learning is often referred to as a class of learning systems that lies in between supervised and unsupervised learning systems. A reinforcement, or reinforcing stimulus, is defined as a stimulus that strengthens the behaviour that produced it. As an example, consider the procedure of training an animal. In general, there is no point in trying to explain to the animal how it should behave. The only way is simply to reward the animal when it does the right thing. If an animal is given a piece of food each time it presses a button when a light is flashed, it will (in most cases) learn to press the button when the light signal appears. We say that the animal’s behaviour has been reinforced. We use the food as a reward to train the animal. One could, in this case, say that it is the food itself that reinforces the behaviour. In general, there is some mechanism in the animal that generates an internal reinforcement signal when the animal gets food (at least if it is hungry) and when it experiences other things that are good for it i.e. that increase the probability of the reproduction of its genes. A biochemical process involving dopamine is believed to play a central role in the distribution of the reward signal (Bloom and Lazerson, 1985; Schultz et al., 1997). In the 1950s, experiments were made (Olds and Milner, 1954) where the internal reward system was artificially stimulated instead of giving an external reward. In this case, the animal was even able to learn self destructive behaviour. In the example above, the reward (piece of food) was used merely to trigger the reinforcement signal. In the following discussion of artificial systems, however, the two terms have the same meaning. In other words, we will use only one kind of reward, namely the reinforcement signal itself, which we in the case of an artificial system can allow us to have direct access to without any ethical considerations. In case of a large system, one would of course want the system to be able to solve different routine tasks besides the main task (or tasks). For instance, suppose we want the system to learn to charge its batteries. Such a behaviour should 14 Learning systems then be reinforced in some way. Whether we put a box into the system that reinforces the battery-charging behaviour or we let the charging device or a teacher deliver the reinforcement signal is a technical question rather than a philosophical one. If, however, the box is built into the system, we can reinforce behaviour by charging the system’s batteries. Reinforcement learning is strongly associated with learning among animals (including humans) and some people find it hard to see how a machine could learn by a “trial-and-error” method. To show that machines can indeed learn in this way, a simple example was created by Donald Michie in the 1960s. A pile of matchboxes that learns to play noughts and crosses illustrates that even a very simple machine can learn by trial and error. The machine is called MENACE (Match-box Educable Noughts And Crosses Engine) and consists of 288 match-boxes, one for each possible state of the game. Each box is filled with a random set of coloured beans. The colours represent different moves. Each move is determined by the colour of a randomly selected bean from the box representing the current state of the game. If the system wins the game, new beans with the same colours as those selected during the game are added to the respective boxes. If the system loses, the beans that were selected are removed. In this way, after each game, the possibility of making good moves increases and the risk of making bad moves decreases. Ultimately, each box will only contain beans representing moves that have led to success. There are some notable advantages with reinforcement learning compared to supervised learning, besides the obvious fact that reinforcement learning can be used in some situations where supervised learning is impossible (e.g. the child learning to bicycle and the animal learning examples above). The ability to learn by receiving rewards makes it possible for a reinforcement learning system to become more skilful than its teacher. It can even improve its behaviour by training itself, as in the backgammon program by Tesauro (1990). 2.4.1 Searching for higher rewards In reinforcement learning, the feedback to the system contains no gradient information, i.e. the system does not know in what direction to search for a better solution. For this reason, most reinforcement learning systems are designed to have a stochastic behaviour. A stochastic behaviour can be obtained by adding noise to the output of a deterministic input-output function or by generating the output from a probability distribution. In both cases, the output can be seen as consisting of two parts: one deterministic and one stochastic. It is easy to see that both these parts are necessary in order for the system to be able to improve its behaviour. The deterministic part is the optimum response given the current knowledge. Without the deterministic part, the system would make no sensible 2.4 Reinforcement learning 15 decisions at all. However, if the deterministic part was the only one, the system would easily get trapped in a non-optimal behaviour. As soon as the received rewards are consistent with current knowledge, the system will be satisfied and never change its behaviour. Such a system will only maximize the reward predicted by the internal model but not the external reward actually received. The stochastic part of the response provides the system with information from points in the decision space that would never be sampled otherwise. So, the deterministic part of the output is necessary for generating good responses with respect to the current knowledge and the stochastic part is necessary for gaining more knowledge. The stochastic behaviour can also help the system avoid getting trapped in local maxima. The conflict between the need for exploration and the need for precision is typical of reinforcement learning. The conflict is usually referred to as the exploration-exploitation dilemma. This dilemma does not normally occur in supervised learning. At the beginning when the system has poor knowledge of the problem to be solved, the deterministic part of the response is very unreliable and the stochastic part should preferably dominate in order to avoid a misleading bias in the search for correct responses. Later on, however, when the system has gained more knowledge, the deterministic part should have more influence so that the system makes at least reasonable guesses. Eventually, when the system has gained a lot of experience, the stochastic part should be very small in order not to disturb the generation of correct responses. A constant relation between the influence of the deterministic and stochastic parts is a compromise which will give a poor search behaviour (i.e. slow convergence) at the beginning and bad precision after convergence. Therefore, many reinforcement learning systems have noise levels that decays with time. There is, however, a problem with such an approach too. The decay rate of the noise level must be chosen to fit the problem. A difficult problem takes longer time to solve and if the noise level is decreased too fast, the system may never reach an optimal solution. Conversely, if the noise level decreases too slowly, the convergence will be slower than necessary. Another problem arises in a dynamic environment where the task may change after some time. If the noise level at that time is too low, the system will not be able to adapt to the new situation. For these reasons, an adaptive noise level is to prefer. The basic idea of an adaptive noise level is that when the system has a poor knowledge of the problem, the noise level should be high and when the system has reached a good solution, the noise level should be low. This requires an internal quality measure that indicates the average performance of the system. It could of course be accomplished by accumulating the rewards delivered to the system, for 16 Learning systems instance by an iterative method, i.e. p(t + 1) = αp(t ) + (1 , α)r(t ); (2.7) where p is the performance measure, r is the reward and α is the update factor, 0 < α < 1. Equation 2.7 gives an exponentially decaying average of the rewards given to the system, where the most recent rewards will be the most significant ones. A solution, involving a variance that depends on the predicted reinforcement, has been suggested by Gullapalli (1990). The advantage with such an approach is that the system might expect different rewards in different situations for the simple reason that the system may have learned some situations better than others. The system should then have a very deterministic behaviour in situations where it predicts high rewards and a more exploratory behaviour in situations where it is more uncertain. Such a system will have a noise level that depends on the local skill rather than the average performance. Another way of controlling the noise level, or rather the standard deviation σ of a stochastic output unit, is found in the REINFORCE algorithm (Williams, 1988). Let µ be the mean of the output distribution and y the actual output. When the output y gives a higher reward than the recent average, the variance will decrease if jy , µj < σ and increase if jy , µj > σ. When the reward is less than average, the opposite changes are made. This leads to a more narrow search behaviour if good solutions are found close to the current solution or bad solutions are found outside the standard deviation and a wider search behaviour if good solutions are found far away or bad solutions are found close to the mean. Another strategy for a reinforcement learning system to improve its behaviour is to differentiate a model of the reward with respect to the system parameters in order to estimate the gradient of the reward in the system’s parameter space. The model can be known a priori and built into the system, or it can be learned and refined during the training of the system. To know the gradient of the reward means to know in which direction in the parameter space to search for a better performance. One way to use this strategy is described by Munro (1987) where the model is a secondary network that is trained to predict the reward. This can be done with back-propagation, using the difference between the reward and the prediction as an error measure. Then back-propagation can be used to modify the weights in the primary network, but here with the aim of maximizing the prediction done by secondary network. A similar approach was used to train a pole-balancing system (Barto et al., 1983). Other examples of similar strategies are described by Williams (1988). 2.4 Reinforcement learning 17 Adaptive critics When the learning system operates in a dynamic environment, the system may have to carry out a sequence of actions to get a reward. In other words, the feedback to such a system may be infrequent and delayed and the system faces what is known as the temporal credit assignment problem (see section 2.7.2 on page 35). Assume that the environment or process to be controlled is a Markov process. A Markov process consists of a set S of states si where the conditional probability of a state transition only depends on a finite number of previous states. The definition of the states can be reformulated so that the state transition probabilities only depend on the current state, i.e. P(sk+1 j sk ; sk,1 ; : : : ; s1 ) = P(s0 k+1 j s0 k ); (2.8) which is a first order Markov process. Derin and Kelly (1989) present a systematic classification of different types of Markov models. Suppose one or several of the states in a Markov process are associated with a reward. Now, the goal for the learning system can be defined as maximizing the total accumulated reward for all future time steps. One way to accomplish this task for a discrete Markov process is, like in the MENACE example above, to store all states and actions until the final state is reached and to update the state transition probabilities afterwards. This method is referred to as batch learning. An obvious disadvantage with batch learning is the need for storage which will become infeasible for large dimensionalities of the input and output vectors as well as for long sequences. A problem occurring when only the final outcome is considered is illustrated in figure 2.2. Consider a game where a certain position has resulted in a loss in 90% of the cases and a win in 10% of the cases. This position is classified as a bad position. Now, suppose that a player reaches a novel state (i.e. a state that has not been visited before) that inevitably leads to the bad state and finally happens to lead to a win. If the player waits until the end of the game and only looks at the result, he would label the novel state as a good state since it led to a win. This is, however, not true. The novel state is a bad state since it probably leads to a loss. Adaptive critics (Barto, 1992) is a class of methods designed to handle the problem illustrated in figure 2.2. Let us, for simplicity, assume that the input vector xk uniquely defines the state sk 1 . Suppose that for each state xk there is a value Vg (xk ) that is an estimate of the expected future result (e.g. a weighted sum of the accumulated reinforcement) when following a policy g, i.e. generating the output as y = g(x). In adaptive critics, the value Vg (xk ) depends on the value 1 This assumption is of course not always true. When it does not hold, the system faces the perceptual aliasing problem which is discussed in section 2.7.1 on page 33. 18 Learning systems loss 90 % bad novel 10 % win Figure 2.2: An example to illustrate the advantage of adaptive critics. A state that is likely to lead to a loss is classified as a bad state. A novel state that leads to the bad state but then happens to lead to a win is classified as a good state if only the final outcome is considered. In adaptive critics, the novel state is recognized as a bad state since it most likely leads to a loss. Vg (xk+1 ) and not only on the final result: Vg (xk ) = r(xk ; g(xk )) + γVg (xk+1 ); (2.9) where r(xk ; g(xk )) is the reward for being in the state xk and generating the response yk = g(xk ). This means that N Vg (xk ) = ∑ γi,k r(xk ; g(xk )); (2.10) i=k i.e. the value of a state is a weighted sum of all future rewards. The weight γ 2 [0; 1] can be used to make rewards that are close in time more valuable than rewards further away. Equation 2.9 makes it possible for adaptive critics to improve their predictions during a process without always having to wait for the final result. Suppose that the environment can be described by the function f so that xk+1 = f (xk ; yk ). Now equation 2.9 can be written as Vg (xk ) = r(xk ; g(xk )) + γVg ( f (xk ; g(xk ))) : (2.11) The optimal response y is the response given by the optimal policy g : y = g (x) = arg maxfr(x; y) + V ( f (x; y))g; y where V is the value of the optimal policy (Bellman, 1957). (2.12) 2.4 Reinforcement learning 19 In the methods of temporal differences (TD) described by Sutton (1988), the value function V is estimated using the difference between the values of two consecutive states as an internal reward signal. Another well known method for adaptive critics is Q-learning (Watkins, 1989). In Q-learning, the system is trying to estimate the Q-function Qg (x; y) = r(x; y) + Vg ( f (x; y)) (2.13) rather than the value function V itself. Using the Q-function, the optimal response is y = g (x) = arg maxfQ (x; y)g: y (2.14) This means that a model of the environment f is not required in Q-learning in order to find the optimal response. In control theory, an optimization algorithm called dynamic programming is a well-known method for maximizing the expected total accumulated reward. The relationship between TD-methods and dynamic programming has been discussed for example by Barto (1992), Werbos (1990) and Whitehead et al. (1990). It should be noted, however, that maximizing the expected accumulated reward is not always the best criterion, as discussed by Heger (1994). He notes that this criterion of choice of action is based upon long-run consideration where the decision process is repeated a sufficiently large number of times. It is not necessarily a valid criterion in the short run or one-shot case, especially when the possible consequences or their probabilities have extreme values. assumes the subjective values of possible outcomes to be proportional to their objective values, which is not necessarily the case, especially when the values involved are large. As an illustrative example, many people occasionally play on lotteries in spite of the fact that the expected outcome is negative. Another example is that most people do not invest all their money in stocks although such a strategy would give a larger expected payoff than putting some of it in the bank. The first well-known use of adaptive critics was in a checkers playing program (Samuel, 1959). In that system, the value of a state (board position) was updated according to the values of future states likely to appear. The prediction of future states requires a model of the environment (game). This is, however, not the case in TD-methods like the adaptive heuristic critic algorithm (Sutton, 1984) where the feedback comes from actual future states and, hence, prediction is not necessary. 20 Learning systems Sutton (1988) has proved a convergence theorem for one TD-method2 that states that the prediction for each state asymptotically converges to the maximumlikelihood prediction of the final outcome for states generated in a Markov process. Other proofs concerning adaptive critics in finite state systems have been presented, for example by Watkins (1989), Jaakkola et al. (1994) and Baird (1995). Proofs for continuous state spaces have been presented by Werbos (1990), Bradtke (1993) and Landelius (1997). Other methods for handling delayed rewards are for example heuristic dynamic programming (Werbos, 1990) and back-propagation of utility (Werbos, 1992). Recent physiological findings indicate that the output of dopaminergic neurons indicate errors in the predicted reward function, i.e. the internal reward used in TD-learning (Schultz et al., 1997). 2.4.2 Generating the reinforcement signal Werbos (1990) defines a reinforcement learning system as “any system that through interaction with its environment improves its performance by receiving feedback in the form of a scalar reward (or penalty) that is commensurate with the appropriateness of the response.” The goal for a reinforcement learning system is simply to maximize the reward, for example the accumulated value of the reinforcement signal r. Hence, r can be said to define the problem to be solved and therefore the choice of reward function is very important. The reward, or reinforcement, must be capable of evaluating the overall performance of the system and be informative enough to allow learning. In some cases, how to choose the reinforcement signal is obvious. For example, in the pole balancing problem (Barto et al., 1983), the reinforcement signal is chosen as a negative value upon failure and as zero otherwise. Many times, however, how to measure the performance is not evident and the choice of reinforcement signal will affect the learning capabilities of the system. The reinforcement signal should contain as much information as possible about the problem. The learning performance of a system can be improved considerably if a pedagogical reinforcement is used. One should not sit and wait for the system to attain a perfect performance, but use the reward to guide the system to a better performance. This is obvious in the case of training animals and this TD-method, called TD(0), the value Vk only depends on the following value Vk+1 and not on later predictions. Other TD-methods can take into account later predictions with a function that decreases exponentially with time. 2 In 2.4 Reinforcement learning 21 humans, but it also applies to the case of training artificial systems with reinforcement learning. Consider, for instance, an example where a system is to learn a simple function y = f (x). If a binary reward is used, i.e. ( r= 1 0 i f jỹ , yj < ε ; else (2.15) where ỹ is the output of the system and y is the correct response, the system will receive no information at all3 as long as the responses are outside the interval defined by ε. If, on the other hand, the reward is chosen inversely proportional to the error, i.e. r= 1 jỹ , yj (2.16) a relative improvement will yield the same relative increase in reward for all output. In practice, of course, the reward function in equation 2.16 could cause numerical problems, but it serves as an illustrative example of a well-shaped reward function. In general, a smooth and continuous function is preferable. Also, the derivative should not be too small, at least not in regions where the system should not get stuck, i.e. in regions of bad performance. It should be noted, however, that sometimes there is no obvious way of defining a continuous reward function. In the case of pole balancing (Barto et al., 1983), for example, the pole either falls or not. A perhaps more interesting example where a pedagogical reward is used can be found in a paper by Gullapalli (1990), which presents a “reinforcement learning system for learning real-valued functions”. This system was supplied with two input variables and one output variable. In one case, the system was trained on an XOR-task. Each input was 0:1 or 0:9 and the output was any real number between 0 and 1. The optimal output values were 0:1 and 0:9 according to the logical XOR-rule. At first, the reinforcement signal was calculated as r = 1 ,j + εj; (2.17) where ε is the difference between the output and the optimal output. The system sometimes converged to wrong results, and in several training runs it did not converge at all. A new reinforcement signal was calculated as r0 = 3 Well, r + rtask : 2 (2.18) almost none in any case, and as the number of possible solutions which give output outside the interval approaches infinity (which it does in a continuous system), the information approaches zero. 22 Learning systems The term rtask was set to 0.5 if the latest output for similar input was less than the latest output for dissimilar input and to -0.5 otherwise. With the reinforcement signal in equation 2.18, the system began by trying to satisfy a weaker definition of the XOR-task, according to which the output should be higher for dissimilar inputs than for similar inputs. The learning performance of the system improved in several ways with the new reinforcement signal. Another reward strategy is to reward only improvements in behaviour, for example by calculating the reinforcement as r = p , r̄; (2.19) where p is a performance measure and r̄ is the mean reward acquired by the system. Equation 2.19 gives a system that is never satisfied since the reward vanishes in any solution with a stable reward. If the system has an adaptive search behaviour as described in the previous section, it will keep on searching for better and better solutions. The advantage with such a reward is that the system will not get stuck in a local optimum. The disadvantage is, of course, that it will not stay in the global optimum either, if such an optimum exists. It will, however, always return to the global optimum and this behaviour can be useful in a dynamic environment where a new optimum may appear after some time. Even if the reward in the previous equation is a bit odd, it points out the fact that there might be negative reward or punishment. The pole balancing system (Barto et al., 1983) is an example of the use of negative reinforcement and in this case it is obvious that it is easier to deliver punishment upon failure than reward upon success since the reward would be delivered after an unpredictably long sequence of actions; it would take an infinite amount of time to verify a success! In general, however, it is probably better to use positive reinforcement to guide a system towards a solution for the simple reason that there is usually more information in the statement “this was a good solution” than in the opposite statement “this was not a good solution”. On the other hand, if the purpose is to make the system avoid a particular solution (i.e. “Do anything but this!”), punishment would probably be more efficient. 2.4.3 Learning in an evolutionary perspective In this section, a special case of reinforcement learning called genetic algorithms is described. The purpose is not to give a detailed description of genetic algorithms, but to illustrate the fact that they are indeed reinforcement learning algorithms. From this fact and the obvious similarity between biological evolution and genetic algorithms (as indicated in the name), some interesting conclusions can be drawn concerning the question of learning at different time scales. 2.5 Unsupervised learning 23 A genetic algorithm is a stochastic search method for solving optimization problems. The theory was founded by Holland (1975) and it is inspired by the theory of natural evolution. In natural evolution, the problem to be optimized is how to survive in a complex and dynamic environment. The knowledge of this problem is encoded as genes in the individuals’ chromosomes. The individuals that are best adapted in a population have the highest probability of reproduction. In reproduction, the genes of the new individuals (children) are a mixture or crossover of the parents’ genes. In reproduction there is also a random change in the chromosomes. The random change is called mutation. A genetic algorithm works with coded structures of the parameter space in a similar way. It uses a population of coded structures (individuals) and evaluates the performance of each individual. Each individual is reproduced with a probability that depends on that individual’s performance. The genes of the new individuals are a mixture of the genes of two parents (crossover), and there is a random change in the coded structure (mutation). Thus, genetic algorithms learn by the method of trial and error, just like other reinforcement learning algorithms. We might therefore argue that the same basic principles hold both for developing a system (or an individual) and for adapting the system to its environment. This is important since it makes the question of what should be built into the machine from the beginning and what should be learned by the machine more of a practical engineering question than a principal one. The conclusion does not make the question less important though; in practice, it is perhaps one of the most important issues. Another interesting relation between evolution and learning on the individual level is discussed by Hinton and Nowlan (1987). They show that learning organisms evolve faster than non-learning equivalents. This is maybe not very surprising if evolution and learning are considered as merely different levels of a hierarchical learning system. Then the convergence of the slow high-level learning process (corresponding to evolution) depends on the adaptability of the faster low-level learning process (corresponding to individual learning). This indicates that hierarchical systems adapt faster than non-hierarchical systems of the same complexity. More information about genetic algorithms can be found for example in the books by Davis (1987) and Goldberg (1989). 2.5 Unsupervised learning In unsupervised learning there is no external feedback at all (see figure 2.1 on page 10). The system’s experience mentioned on page 9 consists of a set of signals and the measure of performance is often some statistical or information theoretical 24 Learning systems property of the signal. Unsupervised learning is perhaps not learning in the word’s everyday sense, since the goal is not to learn to produce responses in the form of useful actions. Rather, it is to learn a certain representation which is thought to be useful in further processing. The importance of a good representation of the signals is discussed in chapter 3. Unsupervised learning systems are often called self-organizing systems (Haykin, 1994; Hertz et al., 1991). Hertz et al. (1991) describe two principles for unsupervised learning: Hebbian learning and competitive learning. Also Haykin (1994) uses these two principles but adds a third one that is based on mutual information, which is an important concept in this thesis. Next, these three principles of unsupervised learning are described. 2.5.1 Hebbian learning Hebbian learning originates from the pioneering work of neuropsychologist Hebb (1949). The basic idea is that when one neuron repeatedly causes a second neuron to fire, the connection between them is strengthened. Hebb’s idea has later been extended to include the formulation that if the two neurons have uncorrelated activities, the connection between them is weakened. In learning and neural network theory, Hebbian learning is usually formulated more mathematically. Consider a linear unit where the output is calculated as N y = ∑ wi xi : (2.20) i=1 The simplest Hebbian learning rule for such a linear unit is wi (t + 1) = wi (t ) + αxi (t )y(t ): (2.21) Consider the expected change ∆w of the parameter vector w using y = xT w: E [∆w] = αE [xxT ]w = αCxx w: (2.22) Since Cxx is positive semi-definite, any component of w parallel to an eigenvector of Cxx corresponding to a non-zero eigenvalue will grow exponentially and a component in the direction of an eigenvector corresponding to the largest eigenvalue (in the following called a maximal eigenvector) will grow fastest. Therefore we see that w will approach a maximal eigenvector of Cxx . If x has zero mean, Cxx is the covariance matrix of x and, hence, a linear unit with Hebbian learning will find the direction of maximum variance in the input data, i.e. the first principal component of the input signal distribution (Oja, 1982). Principal component analysis (PCA) is discussed in section 4.2 on page 64. 2.5 Unsupervised learning 25 A problem with equation 2.21 is that it does not converge. A solution to this problem is Oja’s rule (Oja, 1982): wi (t + 1) = αy(t ) (xi (t ) , y(t )wi (t )) : (2.23) This extension of Hebb’s rule makes the norm of w approach 1 and the direction will still approach that of a maximal eigenvector, i.e. the first principal component of the input signal distribution. Again, if x has zero mean, Oja’s rule finds the one-dimensional representation y of x that has the maximum variance under the constraint that kwk = 1. In order to find more than one principal component, Oja (1989) proposed a modified learning rule for N units: N wi j (t + 1) = αyi (t ) x j (t ) , ∑ yk (t )wk j (t ) ! ; (2.24) k=1 where wi j is the weight j in unit i. A similar modification for N units was proposed by Sanger (1989), which is identical to equation 2.24 except for the summation that ends at i instead of N. The difference is that Sanger’s rule finds the N first principal components (sorted in order) whereas Oja’s rule finds N vectors spanning the same subspace as the N first principal components. A note on correlation and covariance matrices In neural network literature, the matrix Cxx in equation 2.22 is often called a correlation matrix. This can be a bit confusing, since Cxx does not contain the correlations between the variables in a statistical sense, but rather the expected values of the products between them. The correlation between xi and x j is defined as ρi j = pEE[([(x x,i ,x̄ x̄)i2)(]Ex[(j ,x x̄,j )]x̄ )2] i i j (2.25) j (see for example Anderson, 1984), i.e. the covariance between xi and x j normalized by the geometric mean of the variances of xi and x j (x̄ = E [x]). Hence, the correlation is bounded, ,1 ρi j 1, and the diagonal terms of a correlation matrix, i.e. a matrix of correlations, are one. The diagonal terms of Cxx in equation 2.22 are the second order origin moments, E [x2i ], of xi . The diagonal terms in a covariance matrix are the variances or the second order central moments, E [(xi , x̄i )2 ], of xi . The maximum likelihood estimator of ρ is obtained by replacing the expectation operator in equation 2.25 by a sum over the samples (Anderson, 1984). This estimator is sometimes called the Pearson correlation coefficient after Pearson (1896). 26 Learning systems 2.5.2 Competitive learning In competitive learning there are several computational units competing to give the output. For a neural network, this means that among several units in the output layer only one will fire while the rest will be silent. Hence, they are often called winner-take-all units. Which unit fires depends on the input signal. The units specialize to react on certain stimuli and therefore they are sometimes called grandmother cells. This term was coined to illustrate the lack of biological plausibility for such highly specialized neurons. (There is probably not a single neuron in your brain waiting just to detect your grandmother.) Nevertheless, the most well-known implementation of competitive learning, the self-organizing feature map (SOFM) (Kohonen, 1982), is highly motivated by the topologically organized feature representations in the brain. For instance, in the visual cortex, line detectors are organized on a two-dimensional surface so that adjacent detectors for the orientation of a line are sensitive to similar directions (Hubel and Wiesel, 1962). In the simplest case, competitive learning can be described as follows: Each unit gets the same input x and the winner is unit i if kwi , xk < kw j , xk; 8 j 6= i. A simple learning rule is to update the parameter vector of the winner according to wi (t + 1) = wi (t ) + α(x(t ) , wi (t )); (2.26) i.e. to move the winning parameter vector towards the present input. The rest of the parameter vectors are left unchanged. If the output of the winning unit is one, equation 2.26 can be written as wi (t + 1) = wi (t ) + αyi (x(t ) , wi (t )) (2.27) for all units (since yi = 0 for all losers). Equation 2.27 is a modification of the Hebb rule in equation 2.21 and is identical to Oja’s rule (equation 2.23) if yi 2 f0; 1g (Hertz et al., 1991). Vector quantization A rather simple, but important, application of competitive learning is vector quantization (Gray, 1984). The purpose of vector quantization is to quantize a distribution of vectors x into N classes so that all vectors that fall into one class can be represented by a single prototype vector wi . The goal is to minimize the distortion between the input vectors x and the prototype vectors. The distortion measure is usually defined using a Euclidean metric: Z D= p(x)kx , wk2 dx; (2.28) RN 2.5 Unsupervised learning 27 where p(x) is the probability density function of x. Kohonen (1989) has proposed a modification to the competitive learning rule in equation 2.26 for use in classification tasks: ( wi (t + 1) = wi (t ) , wi (t )) ,α(x(t ) , wi (t )) +α(x(t ) if correct classification if incorrect classification. (2.29) The need for feedback from a teacher means that this is a supervised learning rule. It works as the standard competitive learning rule in equation 2.26 if the winning prototype vector represents the desired class but moves in the opposite direction if it does not. The learning rule is called learning vector quantization (LVQ) and can be used for classification. (Note that several prototype vectors can belong to the same class.) Feature maps The self-organizing feature map (SOFM) (Kohonen, 1982) is an unsupervised competitive learning rule but without winner-take-all units. It is similar to the vector quantization methods just described but has local connections between the prototype vectors. The standard update rule for a SOFM is wi (t + 1) = wi (t ) + αh(i; j)(x(t ) , wi (t )); (2.30) where h(i; j) is a neighbourhood function which is dependent on the distance between the current unit vector i and the winner unit j. A common choice of h(i; j) is a Gaussian. Note that the distance is not between the parameter vectors but between the units in a network. Hence, a topological ordering of the units is implied. Note also that all units, and not only the winner, are updated (although some of them with very small steps). The topologically ordered units and the neighbourhood function cause nearby units to have more similar prototype vectors than units far apart. Hence, if these parameter vectors are seen as feature detectors (i.e. filters), similar features will be represented by nearby units. Equation 2.30 causes the parameter vectors to be more densely distributed in areas where the input probability is high and more sparsely distributed where the input probability is low. Such a behaviour is desired if the goal is to keep the distortion (equation 2.28) low. The density of parameter vectors is, however, not strictly proportional to the input signal probability (Ritter, 1991), which would minimize the distortion. Higher level competitive learning Competitive learning can also be used on a higher level in a more complex learning system. The function of the whole system is not necessarily based on unsu- 28 Learning systems pervised learning. It can be trained using supervised or reinforcement learning. But the system can be divided into subsystems that specialize on different parts of the decision space. The subsystem that handle a certain part of the decision space best will gain control over that part. An example is the adaptive mixtures of local experts by Jacobs et al. (1991). They use a system with several local experts and a gating network that selects among the output of the local experts. The whole system uses supervised learning but the gating network causes the local experts to compete and therefore to try to take responsibility for different parts of the input space. 2.5.3 Mutual information based learning The third principle of unsupervised learning is based on the concept of mutual information. Mutual information is gaining an increased attention in the signal processing society as well as among learning theorists and neural network researchers. The theory, however, dates back to 1948 when Shannon presented his classic foundations of information theory (Shannon, 1948). A piece of information theory Consider a discrete random variable x: x 2 fxi g; i 2 f1; 2; : : : ; N g: (2.31) (There is, in practice, no limitation in x being discrete since all measurements have finite precision.) Let P(xk ) be the probability of x = xk for a randomly chosen x. The information content in the vector (or symbol) xk is defined as 1 I (xk ) = log P(xk ) = , log P(xk ) : (2.32) If the basis 2 is used for the logarithm, the information is measured in bits. The definition of information has some appealing properties. First, the information is 0 if P(xk ) = 1; if the receiver of a message knows that the message will be xk , he does not get any information when he receives the message. Secondly, the information is always positive. It is not possible to lose information by receiving a message. Finally, the information is additive, i.e. the information in two independent symbols is the sum of the information in each symbol: I (xi ; x j ) = , log (P(xi ; x j )) = , log (P(xi )P(x j )) = , log P(xi ) , log P(x j ) = I (xi ) + I (x j ) if xi and x j are statistically independent. (2.33) 2.5 Unsupervised learning 29 The information measure considers each instance of the stochastic variable x but it does not say anything about the stochastic variable itself. This can be accomplished by calculating the average information of the stochastic variable: N N i=1 i=1 H (x) = ∑ P(xi )I (xi ) = , ∑ P(xi ) log(P(xi )): (2.34) H (x) is called the entropy of x and is a measure of uncertainty about x. Now, we introduce a second discrete random variable y, which, for example, can be an output signal from a system with x as input. The conditional entropy (Shannon, 1948) of x given y is H (xjy) = H (x; y) , H (y): (2.35) The conditional entropy is a measure of the average information in x given that y is known. In other words, it is the remaining uncertainty of x after observing y. The average mutual information4 I (x; y) between x and y is defined as the average information about x gained when observing y: I (x; y) = H (x) , H (xjy): (2.36) The mutual information can be interpreted as the difference between the uncertainty of x and the remaining uncertainty of x after observing y. In other words, it is the reduction in uncertainty of x gained by observing y. Inserting equation 2.35 into equation 2.36 gives I (x; y) = H (x) + H (y) , H (x; y) = I (y; x) (2.37) which shows that the mutual information is symmetric. Now let x be a continuous random variable. Then the differential entropy h(x) is defined as (Shannon, 1948) Z h(x) = , p(x) log p(x) dx; (2.38) RN where p(x) is the probability density function of x. The integral is over all dimensions in x. The average information in a continuous variable would of course be infinite since there are an infinite number of possible outcomes. This can be seen if the discrete entropy definition (eq. 2.34) is calculated in limes when x approaches a continuous variable: H (x) = , lim ∞ ∑ δx!0 i=,∞ 4 Shannon p(xi )δx log ( p(xi )δx) = h(x) , lim log δx; δx!0 (2.39) (1948) originally used the term rate of transmission. The term mutual information was introduced later. 30 Learning systems where the last term approaches infinity when δx approaches zero (Haykin, 1994). But since mutual information considers the difference in entropy, the infinite term will vanish and continuous variables can be used to simplify the calculations. The mutual information between the continuous random variables x and y is then I (x; y) = h(x) + h(y) , h(x; y) = Z Z RN RM p(x; y) log p(x; y) dxdy; (2.40) p(x) p(y) where N and M are the dimensionalities of x and y respectively. Consider the special case of Gaussian distributed variables. The differential entropy of an N-dimensional Gaussian variable z is h(z) = , 1 log (2πe)N jCj 2 (2.41) where C is the covariance matrix of z (see proof B.1.1 on page 153). This means that the mutual information between two N-dimensional Gaussian variables is I (x; y) = 1 log 2 where C= C jCxx j jCyyj xx Cyx jCj Cxy Cyy ; (2.42) : Cxx and Cyy are the within-set covariance matrices and Cxy = CTyx is the betweensets covariance matrix. For more details on information theory, see for example Gray (1990). Mutual information based learning Linsker (1988) showed that Hebbian learning gives maximum mutual information between the input and the output in a simple case with a linear unit with noise added to the output. In a more advanced model with several units, he showed that there is a tradeoff between keeping the output signals uncorrelated and suppressing the noise. Uncorrelated output signals give more information (higher entropy) on the output, but redundancy can help to suppress the noise. The principle of maximizing the information transferred from the input to the output is by Linsker (1988) called the infomax principle. Linsker has proposed a method, based on maximum mutual information, for generating a topologically ordered feature map (Linsker, 1989). The map is similar to the SOFM mentioned in section 2.5.2 (page 27) but in contrast to the SOFM, Linsker’s learning rule causes the distribution of input units to be proportional to the input signal probability density. 2.5 Unsupervised learning 31 max(I (x : y)) x max(I (y1 : y2 )) y - - x1 - - x2 - - (a) y1 y2 (b) Figure 2.3: The difference between infomax (a) and Imax (b). Bell and Sejnowski (1995) have used mutual information maximization to perform blind separation of mixed unknown signals and blind deconvolution of a signal convolved with an unknown filter. Actually, they maximize the entropy in the output signal y rather than explicitly maximizing the mutual information between x and y. The results are, however, the same if there is independent noise in the output but no known noise in the input5 . To see that, consider a system where y = f (x) + η where η is an independent noise signal. The mutual information between x and y is then I (x; y) = h(y) , h(yjx) = h(y) , h(η); (2.43) where h(η) is independent of the parameters of f . Becker and Hinton (1992) have used mutual information maximization in another way than Linsker and Bell and Sejnowski. Instead of maximizing the mutual information between the input and the output they maximize the mutual information between the output of different units, see figure 2.3. They call this principle Imax and have used it to estimate disparity in random-dot stereograms (Becker and Hinton, 1992) and to detect depth discontinuities in stereo images (Becker and Hinton, 1993). A good overview of Imax is given by Becker (1996). Among other mutual information based methods of unsupervised learning are Barlow’s minimum entropy coding that aims at minimizing the statistical dependence between the output signals (Barlow, 1989; Barlow et al., 1989; Földiák, 1990) and the Gmax algorithm (Pearlmutter and Hinton, 1986) that tries to detect statistical dependent features in the input signal. 5 “No known noise” means that the input cannot be divided into a signal part x and a noise part η . The noise is an indistinguishable part of the input signal x. 32 Learning systems The relation between mutual information and correlation There is a clear relation between mutual information and correlation for Gaussian distributed variables. Consider two one-dimensional random variables x and y. Equations 2.42 and 2.25 then gives I (x; y) = 1 log 2 σ2x σ2y σ2x σ2y , (σxy )2 ! = 1 log 2 1 1 , ρ2xy ! ; (2.44) where σ2x and σ2y are the variances of x and y respectively, σxy is the covariance between x and y and ρxy is the correlation between x and y. The extension of this relation to multidimensional variables is discussed in chapter 4. This relationship means that for a single linear unit with Gaussian distributed variables, the mutual information between the input and the output, i.e. the amount of transferred information, is maximized if the correlation between the input and the output is maximized. 2.6 Comparisons between the three learning methods The difference between supervised learning, reinforcement learning and unsupervised learning may seem very fundamental at first. But sometimes the distinction between them is not so clear and the classification of a learning method can depend upon the view of the observer. As we have seen in section 2.4, reinforcement learning can be implemented as a supervised learning of the reward function. The output is then chosen as the one giving the maximum value of the approximation of the reward function given the present input. Another way of implementing reinforcement learning is to use the output of the system as the desired output in a supervised learning algorithm and weight the update step with the reward (Williams, 1988). Furthermore, supervised learning can emerge as a special case of reinforcement learning where the system is forced to give the desired output while receiving maximum reward. Also, a task for a supervised learning system can always be reformulated to fit a reinforcement learning system simply by mapping the error vectors to scalars, for example as a function of the norm of the error vectors. Also unsupervised learning can sometimes be formulated as supervised learning tasks. Consider, for example, the PCA algorithms (section 2.5.1) that find the maximal eigenvectors of the distribution of x. For a single parameter vector w the problem can be formulated as minimizing the difference between the signal x and 2.7 Two important problems 33 the output y = xT ŵ ŵ, i.e. 1 E 2 kx , xT ŵ ŵk2 = E xT x , ŵT xxT ŵ = tr(C) , ŵT Cŵ = ∑ λi , ŵT Cŵ (2.45) ; i where C is the covariance matrix of x (assuming x̄ = 0) and λi are the eigenvalues of C. Obviously, the best choice of w is the maximal eigenvector of C. The output is a reconstruction of x and the desired output is the same as the input. Another example is the methods described in chapter 4 and by van der Burg (1988). Finally, there is a similarity between all three learning principles in that they all generally try to optimize a scalar measure of performance, for example mean square error, accumulated reward, variance, or mutual information. A good example illustrating how similar these three methods can be is the prediction matrix memory in section 3.3.1 2.7 Two important problems There are some important fundamental problems in learning systems. One problem, called perceptual aliasing, deals with the problem of consistency in the internal representation of external states. Another problem is called the credit assignment problem and deals with the problem of distribution of the feedback in the system during learning. These two problems are discussed in this section. A third important problem is how to represent the information in a learning system, which is discussed in chapter 3. 2.7.1 Perceptual aliasing Consider a learning system that perceives the external world through a sensory subsystem and represents the set of external states SE by an internal state representation set SI . This set can, however, rarely be identical to the real external world state set SE . To assume a representation that completely describes the external world in terms of objects, their features and relationships, is unrealistic even for relatively simple problem settings. Furthermore, the internal state is inevitably limited by the sensor system, which leads to the fact that there is a many-to-many mapping between the internal and external states. That is, a state se 2 SE in the external world can map into several internal states and, what is worse, an internal state si 2 SI could represent multiple external world states. This phenomenon has been termed perceptual aliasing (Whitehead and Ballard, 1990a). Figure 2.4 illustrates two cases of perceptual aliasing. One case is when two external states s1e and s2e map into the same internal state s1i . An example is when 34 Learning systems Lerning system si1 si2 si3 SI se1 se2 se3 SE Figure 2.4: Two cases of perceptual aliasing. Two external states s1e and s2e are mapped into the same internal state s1i and one external state s3e is mapped into two internal states s2i and s3i . two different objects appear as identical to the system. This is illustrated in view 1 in figure 2.5. The other case is when one external state s3e is represented by two internal states s2i and s3i . This happens, for instance, in a system consisting of several local adaptive models if two or more models happen to represent the same solution to the same part of the problem. Perceptual aliasing may cause the system to confound different external states that have the same internal state representation. This type of problem can cause a response generating system to make the wrong decisions. For example, let the internal state si represent the external states sae and sbe and let the system generate an action a. The expected reward for the decision (si ; a) to generate the action a given the state si can now be estimated by averaging the rewards for that decision accumulated over time. If sae and sbe occur approximately equally often and the actual accumulated reward for (sae ; a) is greater than the accumulated reward for (sbe ; a), the expected reward will be underestimated for (sae ; a) and overestimated for (sbe ; a), leading to a non-optimal decision policy. There are cases when the phenomenon is a feature, however. This happens if all decisions made by the system are consistent. The reward for the decision (si ; a) then equals the reward for all corresponding actual decisions (ske ; a), where k is an index for this set of decisions. If the mapping between the external and internal worlds is such that all decisions are consistent, it is possible to collapse a large actual state space into a small one where situations that are invariant to the task at hand are mapped onto one single situation in the representation space. For a system operating in a large decision space, such a strategy is in fact necessary in order to reduce the number of different states. The goal is then to find a representation of the decision space such that consistent decisions can be found. The simplest example of such a deliberate perceptual aliasing is quantization. If 2.7 Two important problems 35 2) 1) A s i 2) s1 i s2 i A B B 1) A B Figure 2.5: Avoiding perceptual aliasing by observing the environment from another direction. the quantization is properly designed, the decisions will be consistent within each quantized state. Whitehead and Ballard (1990b) have presented a solution to the problem of perceptual aliasing for a restricted class of learning situations. The basic idea is to detect inconsistent decisions by monitoring the estimated reward error, since the error will oscillate for inconsistent decisions as discussed above. When an inconsistent decision is detected, the system is guided (e.g. by changing its direction of view) to another internal state uniquely representing the desired external state. In this way, more actions will produce consistent decisions (see figure 2.5). The guidance mechanisms are not learned by the system. This is noted by Whitehead who admits that a dilemma is left unresolved: “In order for the system to learn to solve a task, it must accurately represent the world with respect to the task. However, in order for the system to learn an accurate representation, it must know how to solve the task.” The issue of information representation is further discussed in chapter 3. 2.7.2 Credit assignment In all complex control systems, there probably exist some uncertainty of how to distribute credit (or blame) for the control actions taken. This uncertainty is called the credit assignment problem (Minsky, 1961, 1963). Consider, for example, a political system. Is it the trade politics or the financial politics that deserves credit 36 Learning systems for the increasing export? We may call this a structural credit assignment problem. Is it the current government or the previous one that deserves credit or blame for the economic situation? This is a temporal credit assignment problem. Is it the management or the staff that should be given credit for the financial result in a company? This is what we may call a hierarchical credit assignment problem. These three types of credit assignment problems are also encountered in the type of control systems considered here, i.e. learning systems. The structural credit assignment problem occurs, for instance, in a neural network when deciding which weights to alter in order to achieve an improved performance. In supervised learning, the structural credit assignment problem can be handled by using back-propagation (Rumelhart et al., 1986) for instance. The problem becomes more complicated in reinforcement learning where only a scalar feedback is available. In section 3.4, a description is given of how the structural credit assignment problem can be handled by the use of local adaptive models. The temporal credit assignment problem occurs when a system acts in a dynamic environment and a sequence of actions is performed. The problem is to decide which of the actions taken deserves credit for the result. Obviously, it is not certain that it is the final action taken that deserves all the credit or blame. (For example, consider the situation when the losing team in a football game scores a goal during the last seconds of the game. It would not be clever to blame the person who scored that goal for the loss of the game.) The problem becomes especially complicated in reinforcement learning if the reward occurs infrequently. The temporal credit assignment problem is thoroughly investigated by Sutton (1984). Finally, the hierarchical credit assignment problem can occur in a system consisting of several levels. Consider, for example, the Adaptive mixtures of local experts (Jacobs et al., 1991). That system consists of two levels. On the lower level, there are several subsystems that specialize on different parts of the input space. On the top level, there is a supervisor that selects the proper subsystem for a certain input. If the system makes a bad decision, it can be difficult to decide if it was the top level that selected the wrong subsystem or if the top level made a correct choice but the subsystem that generated the response made a mistake. This problem can of course be regarded as a type of structural credit assignment problem, but to emphasize the difference we call it a hierarchical credit assignment problem. Once the hierarchical credit assignment problem is solved and it is clear on what level the mistake was made, the structural credit assignment problem can be dealt with to alter the behaviour on that level. Chapter 3 Information representation A central issue in the design of learning systems is the representation of information in the system. The algorithms treated in this work can be seen as signal processing systems, in contrast to AI or expert systems that have symbolic representations1 . We may refer to the representation used in the signal processing systems as a continuous representation while the symbolic approach can be said to use a string representation. Examples of the latter are the Lion Algorithm (Whitehead and Ballard, 1990a), the Reinforcement Learning Classifier Systems (Smith and Goldberg, 1990) and the MENACE example in section 2.4. The genetic algorithms that were described in section 2.4.3 are perhaps the most obvious examples of string representation in biological reinforcement learning systems. The main difference between the two approaches is that a continuous representation has an implicit metric, i.e. there is a continuum of states and there exist meaningful interpolations between different states. One can say that two states are more or less similar. Interpolations are important in a learning system since they make it possible for the system to make decisions in situations never experienced before. This is often referred to as generalization. In a string representation there is no implicit metric, i.e. there is no unambiguous way to tell which of two strings is more similar to a third string than the other. There are, however, also advantages with string representations. Today’s computers, for example, are designed to work with string representations and have difficulties in handling continuous information in an efficient way. A string representation also make it easy to include a priori knowledge in terms of explicit rules. An approach that can be seen as a mix of symbolic representation and continuous representation is fuzzy logic (Zadeh, 1968, 1988). The symbolic expressions in fuzzy logic include imprecise statements like “many”, “close to”, “usually”, 1 By “symbolic”, a more abstract representation is referred to than just a digitalization of the signal; a digital signal processing system is still a signal processing system. 38 Information representation etc. This means that statements need not be true or false; they can be somewhere in between. This introduces a kind of metric and interpolation is possible (Zadeh, 1988). Lee and Berenji (1989) describe a rule-based fuzzy controller using reinforcement learning that solves the pole balancing problem. Ballard (1990) suggests that it is unreasonable to suppose that peripheral motor and sensory activity are correlated in a meaningful way. Instead, it is likely that abstract sensory and motor representations are built and related to each other. Also, combined sensory and motor information must be represented and used in the generation of new motor activity. This implies a learning hierarchy and that learning occurs on different temporal scales (Granlund, 1978, 1988; Granlund and Knutsson, 1982, 1983, 1990). Hierarchical learning system designs have been proposed by several other researchers (e.g. Jordan and Jacobs, 1994). Both approaches (signal and symbolic) described on the preceding page are probably important, but on different levels in hierarchical learning systems. On a low level, the continuous representation is probably to prefer since signal processing techniques have the potential of being faster than symbolic reasoning as they are easier to implement with analogue techniques. On a low level, interpolations are meaningful and desirable. In a simple control task for instance, consider two similar2 stimuli s1 and s2 which have the optimal responses r1 and r2 respectively. For a novel stimulus s3 located between s1 and s2 , the response r3 could, with large probability, be assumed to be in between r1 and r2 . On a higher level, on the other hand, a more symbolic representation may be needed to facilitate abstract reasoning and planning. Here, the processing speed is not as crucial and interpolation may not even be desirable. Consider, for instance, the task of passing a tree. On a low level, the motor actions are continuous and meaningful to interpolate and they must be generated relatively fast. The higher level decision on which side of the tree to pass is, however, symbolic. Obviously, it is not successful to interpolate the two possible alternatives of “walking to the right” and “walking to the left”. Also, there is more time to make this decision than to generate the motor actions needed for walking. The choice of representation can be crucial for the ability to learn. Geman et al. (1992) argue that “the fundamental challenges in neural modelling are about representation rather than learning per se.” Furthermore, Hertz et al. (1991) present a simple but illustrative example to emphasize the importance of the representation of the input to the system. Two tasks are considered: the first one is to decide whether or not the input is an odd number; 2 Similar means here that they are relatively close to each other in the given metric compared to the variance of the distribution of stimuli. 3.1 The channel representation 39 the second is to decide if the input has an odd number of prime factors. If the input has a binary representation, the first task is extremely simple: the system just has to look at the least significant bit. The second task, however, is very difficult. If the base is changed to 3, for instance, the first task will be much harder. And if the input is represented by its prime factors, the second task will be easier. Hertz et al. (1991) also prove an obvious (and, as they say, silly) theorem: “learning will always succeed, given the right preprocessor.” In the discussion above, representation of two kinds of information is actually treated: the information entering the system as input signals (signal representation) and the information in the system about how to behave, i.e. knowledge learned by the system (model representation). The representations of these two kinds of information are, however, closely related to each other. As we will see, a careful choice of input signal representation can allow for a very simple representation of knowledge. In the following section, a special type of signal representation called the channel representation is presented. It is a representation that is biologically inspired and which has several computational advantages. The later sections will deal more with model representations. The probably most well-known class of model representations among learning systems, neural networks, is presented in section 3.2. They can be seen as global non-linear models. In section 3.3 is shown how the channel representation makes it possible to use a simple linear model. In section 3.4 is argued that low-dimensional linear models are sufficient if they are local enough and the adaptive distribution of such models is briefly discussed in section 3.5. The chapter ends with simple examples of reinforcement learning systems solving the same problem but with different representations. 3.1 The channel representation As has been discussed above, the internal representation of information may play a decisive role for the performances of learning systems. The representation that is intuitively most obvious in a certain situation, for example a scalar t for temperature or a three dimensional vector p = (x y z)T for a position in space, is, however, in some cases not a very good way to represent information. For example, consider an orientation in R 2 which can be represented by an angle ϕ 2 [,π; π] relative to a fix orientation, for example the x-axis. While this may appear as a very natural representation of orientation, it is in fact not a very good one since it has got a discontinuity at π which means that an orientation average cannot be consistently defined (Knutsson, 1989). Another, perhaps more natural, way of representing information is the channel representation (Nordberg et al., 1994; Granlund, 1997). In this representation, a 40 Information representation set of channels is used where each channel is sensitive to some specific feature value in the signal, for example a certain temperature ti or a certain position pi . In the example above, the orientation in R2 could be represented by a set of channels evenly spread out on the unit circle, as proposed by Granlund (1978). If three channels of the shape ck = cos 2 3 4 (ϕ , pk ) ; (3.1) 2π where p1 = 2π 3 ; p2 = 0 and p3 = , 3 , are used (Knutsson, 1982), the orientation can be represented continuously by the channel vector c = (c1 c2 c3 )T which has a constant norm for all orientations. The reason to call this a more natural representation than for instance the angle ϕ, is that the channel representation is frequently used in biological systems, where each nerve cell responds strongly to a specific feature value. One example of this is the orientation sensitive cells in the primary visual cortex (Hubel and Wiesel, 1959; Hubel, 1988). This representation is called value encoding by Ballard (1987) who contrasts it with variable encoding where the activity is monotonically increasing with some parameter. Theoretically, the channels can be designed so that there is one channel for each feature value that can occur. A function of these feature values would then be implemented simply as a look-up table. In practice, however, the range of feature values is often continuous (or at least quantized finely enough to be considered continuous). Each channel can be seen as a response of a filter that is tuned to some specific feature value. The coding is then designed so that the channel has its maximum value (for example one) when the feature and the filter are exactly tuned to each other, and decreases to zero in a smooth way as the feature and the filter become less similar. This is similar to the magnitude representation proposed by Granlund (1989). The channel representation increases the number of dimensions in the representation. It should, however, be noted that an increase in the dimensionality does not have to lead to increased complexity of the learning problem. A great advantage of the channel representation is that it allows for simple processing structures. To see this, consider any continuous function y = f (x). If x is represented by a sufficiently large number of channels ck of a suitable form, the output y can simply be calculated as a weighted sum of the input channels y = wT c however complicated the function f may be. This implies that by using a channel representation, linear operations can be used to a great extent; this fact is used further in this chapter. It is not obvious how to choose the shape of the channels. Consider, for example, the coding of a variable x into channels. According to the description above, each channel is positive and has its maximum for one specific value of x and it decreases smoothly to zero away from this maximum. In addition, to enable representation of all values of x in an interval, there must be overlapping channels on 3.1 The channel representation ck−2 ck−1 41 ck ck+1 ck+2 x Figure 3.1: A set of cos2 channels. Only three channels are activated simultaneously. The sum of the squared channel outputs ck,1 , ck and ck+1 is drawn with a dotted line. this interval. It is also convenient if the norm of the channel vector is constant so that the feature value is only represented by the orientation of the channel vector. This enables the use of the scalar product for calculating the similarity between values. It also makes it possible to use the norm of the channel vector to represent some other entity related to the measurement, for instance the energy or the certainty of the measurement. One channel form that fulfils the requirements above is: ck ( cos2 , π (x , k) jx , kj 3 = 0 < 3 2 otherwise (3.2) (see figure 3.1). This set of channels has a constant norm (see proof B.2.1 on page 154). It also has a constant square sum of its first derivatives (see proof B.2.2 on page 155) (Knutsson, 1982, 1985). This means that a change ∆x in x always gives the same change ∆c in c for any x. Of course, not only scalars can be coded into vectors with constant norm. Any vector v in a vector space of (N , 1) dimensions can be transformed into the orientation of a unit-length vector in an N-dimensional space. This was used, for example, by Denoeux and Lengellé (1993) in order to keep the norm of the input vectors constant and equal to one while preserving all the information. By using this new input representation, a scalar product could be used for calculating the similarity between the input vectors and a set of prototype vectors. The channel vectors described above only exist in a small limited number of dimensions at a time; the channels in all other dimensions are zero. The number of simultaneously active channels is called local dimensionality. In the example in figure 3.1, the local dimensionality is three. This means that the vector moves along a curve as in figure 3.2 (left) as x changes. If we look at channels far apart, 42 Information representation c k+1 c k+n ck ck c k−1 c k−m Figure 3.2: Left: The curve along which a channel vector can move in a subspace spanned by three neighbouring channels. The broken part of the curve illustrates the proceeding of the vector into other dimensions. Right: The possible channel vectors viewed in a subspace spanned by three distant non-overlapping channels. only one of these channels is active at a time (figure 3.2, right); the activity is local. We call this type of channel vector a pure channel vector. The pure channel vector can be seen as an extreme of the sparse distributed coding (Field, 1994). This is a coding that represents data with a minimum number of active units in contrast to compact coding that represents data with a minimum number of units. In general, the input to a system cannot be a pure channel vector. Consider, for example, a system that uses visual input, i.e. images. It is obvious that the dimensionality of the space of pure channel vectors that can represent all images would be far to large to be of practical interest. The input should rather consist of many sets of channels where each set measures a local property in the image, for example local orientation. Each set can be a pure channel vector, but the total input vector, consisting of several concatenated pure channel vectors, will not only have local activity. We call this type of vector, which consists of many sets of channels, a mixed channel vector. The use of mixed channel vectors is not only motivated by limited processing capacity. Consider, for example, the representation of a two-dimensional variable x = (x1 x2 )T . We may represent this variable with a pure channel vector by distributing on the X -plane overlapping channels that are sensitive to different xi , as in figure 3.3 (left). Another way is to represent x with a mixed channel vector by using two sets of channels as in figure 3.3 (right). Here, each set is only sensitive 3.1 The channel representation x 2 43 x 2 x1 x1 Figure 3.3: Left: Representation of a two-dimensional variable with one set of channels that constitute a pure channel vector Right: Representation of the same variable with two sets of channels that together form a mixed channel vector. to one of the two parameters x1 and x2 and it does not depend on the other parameter at all; the channel vector c1 on the x1 -axis is said to be invariant with respect to x2 . Invariance can be seen as a deliberate perceptual aliasing as discussed in section 2.7.1. If x1 and x2 represent different properties of x, for instance colour and size, the invariance can be a very useful feature. It makes it possible to observe one property independently of the others by looking at a subset of the channels. Note, however, that this does not mean that all multidimensional variables should be represented by mixed channel vectors. If, for example, (x1 x2 )T in figure 3.3 represents the two-dimensional position of a physical object, it does not seem useful to see the x1 and x2 positions as two different properties. In this case, the pure channel vector (left) might be a proper representation. The use of mixed channel vectors offers another advantage compared to using the original variables, namely the simultaneous representation of properties which belong to different objects. Consider a one-dimensional variable x representing a position of an object along a line and compare this with a channel vector c representing the same thing. Now, if two objects occur at different positions, a mixed channel vector allows for the positions of both objects to be represented. This is obviously not possible when using the single variable x. Note that the mixed channel vector discussed here differs from the one described previously which consists of two or more concatenated pure channel vectors. In that case, the mixed channel vector represents several features and one instance of each feature. In the case of representing two or more positions, the mixed channel vector represents several 44 Information representation x1 w1 x2 w2 xi wi + f( ) y Figure 3.4: The basic neuron. The output y is a non-linear function f of a weighted sum of the inputs x. instances of the same feature, i.e. multiple events. Both representations are, however, mixed channel vectors in the sense that they can have simultaneous activity on channels far apart as opposed to pure channel vectors. 3.2 Neural networks Neural networks are perhaps the most popular and well-known implementations of artificial learning systems. The concept is so popular that it is often used synonymous with machine learning, which sometimes can be a bit misleading. There is no unanimous definition of neural networks, but they are usually characterized by a large number of massively connected relatively simple processing units. Learning capabilities are often understood even if they are not explicit. One could of course imagine a hard-wired neural network incapable of learning. Neural networks can be seen as global parameterized non-linear models. The processing units in a neural network are often called neurons (hence, the name neural network) since they were originally designed as models of the nerve cells (neurons) in the brain. In figure 3.4, an artificial neuron is illustrated. This basic model of an artificial neuron was proposed by McCulloch and Pitts (1943) where the non-linear function f was a Heaviside (unit step) function, i.e. 8 > <0 f (x) = (1 2 > :1 = x<0 x = 0) : x>0 (3.3) An example of a neural network is the two-layer perceptron illustrated in figure 3.5 which consists of neurons like the one described above connected in a feedforward manner. The neural network is a parameterized model and the parameters 3.2 Neural networks 45 y1 x1 y2 x2 y3 Figure 3.5: A two-layer perceptron with a two-dimensional input and a three-dimensional output. are often called weights. Rosenblatt (1962) presented a supervised learning algorithm for a single layer perceptron. Later, however, Minsky and Papert (1969) showed that a single layer perceptron failed to solve even some simple problems, for example the Boolean exclusive-or function. While it was known that a three-layer perceptron can represent any continuous function, Minsky and Papert doubted that a learning method for a multi-layer perceptron would be possible to find. This finding almost extinguished the interest in neural networks for nearly two decades until the 1980s when learning methods for multi-layer perceptrons were developed. The most well-known method is back-propagation presented in a Ph.D. thesis by Werbos (1974) and later presented by Rumelhart et al. (1986). The solution to the problem of how to update a multi-layer perceptron was to replace the Heaviside function (equation 3.3) with a differentiable nonlinear function, usually a sigmoid function. Examples of common sigmoid functions are f (x) = tanh(x) and the Fermi function: f (x) = 1 : 1 + e,x (3.4) The sigmoid function can be seen as a basis function for the internal representation in the network. Another choice of basis function is the radial-basis function (RBF), for example a Gaussian, that is used in the input layer in RBF networks (Broomhead and Lowe, 1988; Moody and Darken, 1989). The RBFs can be seen as a kind of channel representation. The feed-forward design in figure 3.5 is, of course, not the only possible arrangement of neurons in a neural network. It is also possible to have connections from the output back to the input, so called recurrent networks. Two famous examples of recurrent networks are the Hopfield network (Hopfield, 1982) and the Boltzmann machine (Hinton and Sejnowski, 1983, 1986). 46 Information representation 3.3 Linear models While neural networks are non-linear models, it could sometimes be sufficient to use a linear model, especially if the representation of the input to the system is chosen carefully. As mentioned above, the channel representation makes it possible to realize a rather complicated function as a linear function of the input channels. In fact, the RBF networks can be seen as a hidden layer creating a channel representation followed by an output layer implementing a linear model. In this section, a linear model for reinforcement learning called the prediction matrix memory is presented. 3.3.1 The prediction matrix memory In this subsection, a system that is to learn to produce an output channel vector q as a function of an input channel vector v is described. The functions considered here are continuous functions of a pure channel vector (see page 42) or functions that are dependent on one property while invariant with respect to the others in a mixed channel vector; in other words, functions that can be realized by letting the output channels be linear combinations of the input channels. We call this type of functions first-order functions3 . The order can be seen as the number of events in the input vector that must be considered simultaneously in order to define the output. In practice, this means that, for instance, a first-order function does not depend on any relation between different events; a second-order function depends on the relation between no more than two events and so on. Consider a first-order system which is supplied with an input channel vector v and which generates an output channel vector q. Suppose that v and q are pure channel vectors. If there is a way of defining a scalar r (the reinforcement) for each decision (v, q) (i.e. input-output pair), the function r(v; q) is a second-order function. The tensor space Q V that contains the outer products qvT we call the outer product decision space. In this space, the decision (v, q) is one event. Hence, r can be calculated as a first-order function of the outer product qvT . In practice, the system will, of course, handle a finite number of overlapping channels and r will only be an approximation of the reward. But if the reward function is continuous, this approximation can be made arbitrarily good by using a sufficiently large set of channels. 3 This concept of order have similarities to the one defined by Minsky and Papert (1969). In their discussion, the inputs are binary vectors which of course can be seen as mixed channel vectors with non-overlapping channels. 3.3 Linear models 47 W qv T p Figure 3.6: The reward prediction p for a certain stimulus-response pair (v, q) viewed as a projection onto W in Q V . Learning the reward function If supervised learning is used, the linear function could be learned by training a weight vector w̃i for each output channel qi so that qi = w̃T v. This could be done by minimizing some error function, for instance E = E [kq , q̃k2 ]; (3.5) where q̃ is the correct output channel vector supplied by the teacher. This means, for the whole system, that a matrix W is trained so that a correct output vector is generated as f f q = Wv = e ! wTi v .. . : (3.6) In reinforcement learning, however, the correct output is unknown; only a scalar r that is a measure of the performance of the system is known (see section 2.4 on page 12). But the reward is a function of the stimulus and the response, at least if the environment is not completely stochastic. If the system can learn this function, the best response for each stimulus can be found. As described above, the reward function for a first-order system can be approximated by a linear combination of the terms in the outer product qvT . This approximation can be used as a prediction p of the reward and is calculated as p = hW j qvT i; (3.7) see figure 3.6. The matrix W is therefore called a prediction matrix memory. The reward function can be learned by modifying W in the same manner as in 48 Information representation supervised learning, but here with the aim to minimize the error function E = E [jr , pj2]: (3.8) Now, let each triple (v; q; r) of stimulus, response, and reward denote an experience. Consider a system that has been subject to a number of experiences. How should a proper response be chosen by the system? The prediction p in equation 3.7 can be rewritten as p = qT Wv = hq j Wvi: (3.9) Due to the channel representation, the actual output is completely determined by the direction of the output vector. Hence, we can regard the norm of q as fixed and try to find an optimal direction of q. The q that gives the highest predicted reward obviously has the same direction as Wv. Now, if p is a good prediction of the reward r for a certain stimulus v, this choice of q would be the one that gives the highest reward. An obvious choice of the response q is then q = Wv (3.10) f which is the same first-order function as W suggested for supervised learning in equation 3.6. Since q is a function of the input v, the prediction can be calculated directly from the input. Equation 3.9 together with equation 3.10 give the prediction as p = (Wv)T Wv = kWvk2 : (3.11) Now we have a very simple processing structure (essentially a matrix multiplication) that can generate proper responses and predictions of the associated rewards for any first-order function. This structure is similar to the learning matrix or correlation matrix memory described by Steinbuch and Piske (1963) and later by Anderson (1972, 1983) and by Kohonen (1972, 1989). The correlation matrix memory is a kind of linear associative memory that is trained with a generalization of Hebbian learning (Hebb, 1949). An associative memory maps an input vector a to an output vector b, and the correlation matrix memory stores this mapping as a sum of outer products: M = ∑ baT : (3.12) The stored patterns are then retrieved as b = Ma (3.13) 3.3 Linear models 49 which is equal to equation 3.10. The main difference is that in the method described here, the correlation strength is retrieved and used as a prediction of the reward. Kohonen (1972) has investigated the selectivity and tolerance with respect to destroyed connections in the correlation matrix memories. The training of the matrix W is a very simple algorithm. For a certain experience (v; q; r), the prediction p should, in the optimal case, equal r. This means that the aim is to minimize the error in equation 3.8. The desired weight matrix W0 would yield a prediction p0 = r = hW0 j qvT i: (3.14) Since this is a linear problem, it could be tempting to solve it analytically. This could be done recursively using the recursive least squares (RLS) method (Ljung, 1987). The problem is that RLS involves the estimation and inversion of a p p matrix where p = dim(q)dim(v). Since the dimensionalities of q and v are high in general due to the channel representation, RLS is not a very useful tool in this case. Instead, we use stochastic gradient search (see section 2.3.1 on page 10) to find W0 . From equations 3.7 and 3.8 we get the error ε = jr ,hW j qvT ij2 (3.15) and the gradient is ∂ε ∂W = ,2(r , p)qvT : (3.16) To minimize the error, W should be changed a certain amount a in the direction qvT , i.e. W0 = W + aqvT : (3.17) r = p + akqk2 kvk2 (3.18) Equation 3.14 now gives that (see proof B.2.3 on page 156) which gives a= r, p kqk2 kvk2 : (3.19) To perform stochastic gradient search (equation 2.6 on page 11), we change the parameter vector a small step in the negative gradient direction for each iteration. The update rule therefore becomes W(t + 1) = W(t ) + ∆W(t ); (3.20) 50 Information representation where ∆W = α r, p T kqk2 kvk2 qv ; (3.21) where α is the update factor (0 < α 1) (see section 2.3.2 on page 11). If the channel representation is chosen so that the norm of the channel vectors is constant and equal to one, this equation is simplified to ∆W = α(r , p)qvT : (3.22) Here, the difference between this method and the correlation matrix memory becomes clearer. The learning rule in equation 3.12 corresponds to that in equation 3.22 with α(r , p) = 1. The prediction matrix W in equation 3.22 will converge when r = p, while the correlation matrix M in equation 3.12 would grow for each iteration unless a normalization procedure is used. Here, we can see how reinforcement learning and supervised learning can be combined, as mentioned in section 2.6. By setting r = p + 1 and α = 1 we get the update rule for the correlation matrix memory in equation 3.12, and with r = 1 we get a correlation matrix memory with a converging matrix. This means that if the correct response is known, it can be learned using supervised learning by forcing the output to the correct response and setting the parameters α = 1 and r = 1 or r = p + 1. When the correct response is not known, the system is let to produce the response and the reinforcement learning algorithm described above can be used. Relation to Q-learning The description above of the learning algorithm assumed a reinforcement signal as a feedback to the system for each single decision (i.e. stimulus-response pair). This is, however, not necessary. Instead of learning the instantaneous reward function r(x; y), the system can be trained to learn the Q-function Q(x; y) (equation 2.13 on page 19), which can be written as Q(x(t ); y(t )) = r(x(t ); y(t )) + γQ(x(t + 1); y(t + 1)); (3.23) where γ is a prediction decay factor (0 < γ 1) that makes the predicted reinforcement decay as the distance from the actual rewarded state increases. Now the right-hand side of equation 3.23 can be used instead of r in equation 3.22 as the desired prediction. This gives ∆W = α(r(t ) + γp(t + 1) , p(t ))qvT : (3.24) 3.4 Local linear models 51 This means that the system can handle dynamic problems with infrequent reinforcement signals by maximizing the long-term reward function. In one sense, this system is better suited for the use of TD-methods than the systems mentioned in section 2.4.1 on page 17, since they have to use separate subsystems to calculate the predicted reinforcement. With the algorithm suggested here, this prediction is calculated by the same system as the response. 3.4 Local linear models Global linear models (e.g. the prediction matrix memory) can of course not be used for all problems. The number of dimensions required for a pure channel representation would in general be far too high. But a global non-linear model (e.g. a neural network) is in general not a solution. The number of parameters in a global non-linear model would be far too high to be possible to estimate with a low variance using a reasonable number of samples. The rescue in this situation is that we generally do not need a global model at all. Consider a system with a visual input consisting only of a binary4 image with 8 8 pixels (picture elements), which is indeed a limited visual sensor. There are 264 > 1019 possible different binary 8 8 images. If they were displayed with a frame rate of 50 frames per second, it would take about 10 billion years to view them all, a period of time that is about the same as the age of the universe! It is quite obvious that most of the possible events in a high-dimensional space will never occur during the lifetime of a system. In fact, only a very small fraction of the signal space will ever be visited by the signal. Furthermore, the environment that causes the input signals is limited by the dynamic of the outside world and this dynamic put restrictions on how the input signal can move. This means that the high-dimensional input signal will move on a low-dimensional subspace (Landelius, 1997) and we do not have to search for a global model for the whole signal space (at least if a proper representation is used). The low dimensionality can intuitively be understood if we consider a signal consisting of N frequency components. Such a signal can span at most a 2Ndimensional space since each frequency component defines an ellipse and hence spans at most a two-dimensional plane (Johansson, 1997) (see proof B.2.4 on page 156). In the case of images, this is expressed in the assumption of local one-dimensionality (Granlund, 1978; Granlund and Knutsson, 1995): “The content within a window, measured at a sufficiently small bandwidth, will as a rule have a single dominant component.” 4 Binary, white. in this case, means that each pixel can only have two different values, e.g. black or 52 Information representation By this reasoning, it is sufficient to have a model or a set of models that covers the manifold where the signal exists (Granlund and Knutsson, 1990). If the signal manifold is continuous in space and time (which is reasonable due to the dynamic of the outside world), the low-dimensional manifold could locally be approximated with a linear subspace (Bregler and Omohundro, 1994; Landelius, 1997). Since we are dealing with learning systems, the local models should be adaptive. In this context, low-dimensional linear local models have several advantages. First of all, the number of parameters in a low-dimensional linear model is low, which reduces the number of samples needed for estimating the model compared to a global model. This is necessary since the locality constraint limits the number of available samples. Moreover, the locality reduces the spatial credit assignment problem (section 2.7.2, page 35) since the adaptation of one local model will in general not have any major effects on the other models (Baker and Farell, 1992). How the local linear models should be chosen, i.e. according to what criteria the models’ adaptation should be optimized, depends of course on the task. A method for estimating local linear models for four different criteria is presented in chapter 4. 3.5 Adaptive model distribution In the previous section was argued that the signal distribution in a learning system with high-dimensional input should be modelled with local adaptive models. This raises the question of how to distribute these local models. The simplest way is, of course, to divide the signal space into a number of regions (e.g. N-dimensional boxes) and put an adaptive model in each region. Such an approach is, however, not very efficient since, as have been discussed above, most of the space will be empty and, hence, most models will never be used. Moreover, with such an approach, parts of the signal that could be modelled using one single model would make use of several models due to the pre-defined subdivision. This would cause each of these models to be estimated using a smaller number of samples than would be the case if a single model was used and hence this would cause an unnecessary uncertainty in the parameter estimation. Finally, the pre-defined subdivision cannot be guaranteed to be fine enough in areas where the signal has a complicated behaviour. An obvious solution to this problem is to make the model distribution adaptive. First of all, such an approach would only put models where the signal really exists. Furthermore, an adaptive model distribution makes it possible to distribute models sparsely where the signal has a smooth behaviour and more densely where it has not. 3.6 Experiments 53 An example of adaptive distribution of local linear models is given by Ritter et al. (1989, 1992) who use a SOMF (Kohonen, 1982) (see section 2.5.2, page 27) to distribute local linear models (Jacobian matrices) in a robot positioning task. Other methods are discussed by Landelius (1997) who sugests linear or quadratic models and Gaussian applicability funcitons organized in a tree structure (see also Landelius et al., 1996). The applicability functions define the regions where the local models are valid. In the system by Ritter et al., the applicability functions are defined by a winner-take-all rule for the units in the SOFM (page 27). Just as in the case of estimating the model parameters, the adaptive model distribution is task dependent. If, for example, the goal of the system is to achieve maximum reward, the models should be positioned where they are as useful as possible for getting that reward and if the goal is maximum information transmission, the models should be positioned according to this goal. Hence, no general rule can be given for how to adaptively distribute local models. One can only state that the goal must be to optimize the same criteria as the local models are trying to optimize together. This implies that the choice of models and the distribution of them are dependent on each other. The simpler a model is, i.e. the less parameters it has, the smaller the region will be where it is valid and, hence, the larger the number of models required. This does not mean, however, that a small number of more global complex models is as good as a large number of simpler and more local models, even if the total number of parmeters is the same. As mentioned above (section 3.4), the locality in the latter approach reduces the spatial credit assignment problem and, hence, facilitates efficient learning. 3.6 Experiments This chapter ends with two simple examples of reinforcement learning with different representations. The first one uses the channel representation described in section 3.1 and the prediction matrix memory from section 3.3.1 for learning the Q-function. The second example is a TD-method that uses local adaptive linear models both to represent the input-ouput function and to approximate the V function. This algorithm was presented at the ICANN’93 in Amsterdam (Borga, 1993). The experiment is made up of a system that plays “badminton” with itself. For simplicity, the problem is one-dimensional. The position of the shuttlecock is represented by a variable x. The system can change the value of x by adding the output value y to x. A small noise is also added to punish playing on the margin. The reinforcement signal to the system is zero except upon failure when r = ,1. Failure is the case when x does not change sign (i.e. the shuttlecock does not pass the net), or when jxj > 0:5 (i.e. the shuttlecock ends up outside the court). 54 Information representation 3.6.1 Q-learning with the prediction matrix memory The position x is represented by 25 cos2 -channels in the interval ,0:6 < x < 0:6 and the ouput y is represented by 45 cos2 -channels in the interval ,1:1 < x < 1:1. The channels have the shape defined in equation 3.2, illustrated in figure 3.1 on page 41. An offset value of one was added to the reinforcement signal, i.e. r = 1 except upon failure when r = 0, since the prediction matrix memory must contain positive values. The prediction matrix memory was trained to learn the Q-function as defined in equation 3.23 with the discount factor γ = 0:9. The matrix was updated according to the update rule in equations 3.20 and 3.24. α was set to a constant value of 0:05. The output channel vector q was generated according to equation 3.10. This vector was then decoded into a scalar. As mentioned in section 2.4.1, stochastic search methods are often used in reinforcement learning. Here, this is accomplished by adding Gaussian noise to the output. The variance σ was calculated as σ = maxf0; 0:1 (10 , p)g (3.25) which gives a high noise level when the system predicts a low Q-value and a low noise level if the prediction is high. The value 10 is determined by the maximum i value of the Q-function for γ = 0:9 since ∑∞ i=0 γ = 10. The max operation is to ensure that the variance does not get negative if the stochastic estimation occasionally gives predictions higher than 10. A typical run is illustrated to the left in figure 3.7. The graph shows the accumulated reward in a sliding window of 100 iterations. Note that the original reinforcement signal (i.e. -1 for failure) was used. To the right, the contents of the memory after convergence are illustrated. We see that the highest Q-value is predicted for the positions 0:2 and the corresponding outputs 0:4 approximately, which is a reasonable solution. 3.6.2 TD-learning with local linear models In this experiment, both the predictions p of future accumulated reward and the actions y are linear functions of the input variable x. There is one pair of reinforcement association vectors vi and one pair of action association vectors wi ; i = f1; 2g. For each model i, the predicted reinforcement is calculated as pi = vi1 x + vi2 (3.26) yi = N (µiy ; σy ); (3.27) and the output is calculated as 3.6 Experiments 55 0 1 −20 −40 output reward 0.5 0 −60 −0.5 −80 −100 0 1000 2000 3000 iterations 4000 −1 −0.6 5000 −0.4 −0.2 0 input 0.2 0.4 0.6 Figure 3.7: Left: A typical run of the prediction matrix memory. The graph shows the accumulated reward in a sliding window of 100 iterations. Right: The prediction matrix memory after convergence. Black is zero and white is the maximum value. where µiy = wi1 x + wi2: (3.28) The system chooses the model c such that pc = maxfmi g; (3.29) mi = N ( pi ; σ p ) (3.30) i where and generates the corresponding action yc . The internal reinforcement signal at time t + 1 is calculated as r̂[t + 1] = r[t + 1] + γpmax [t ; t + 1] , pc [t ; t ]: (3.31) This is in principle the same TD-method as the one used by Sutton (1984), except that here there are two predictions at each time, one for each model. pmax [t ; t + 1] is the maximum predicted reinforcement calculated using the reinforcement association vector from time t and the input from time t + 1. If the system fails, i.e. r = ,1, then pmax [t ; t + 1] is set to zero. pc [t ; t ] is the prediction of the selected model. Learning is accomplished by changing the weights in the reinforcement association vectors and the action association vectors. Only the vectors associated with the chosen model are altered. 56 Information representation The association vectors are updated according to the following rule: wc [t + 1] = wc [t ] + αr̂(yc , µcy )x (3.32) vc [t + 1] = vc [t ] + βr̂x ; (3.33) and where c denotes the model choice, α and β are positive learning rate constants and x= x 1 : In this experiment, noise is added to the output on two levels. First in the selection of model and then in the generation of output signal. The noise levels are controlled by σ p and σy respectively, as shown in equations 3.27 and 3.30. The variance parameters are calculated as σ p = maxf0; , 0 1 maxf pi gg : (3.34) and σy = maxf0; , 0 1 pc g : : (3.35) The first “max” in the two equations is to make sure that the variances do not become negative. The negative signs are there because of the (relevant) predictions being negative. In this way, the higher the prediction of reinforcement is, the more precision there will be in the output. The learning behaviour is illustrated to the left in figure 3.8. To the right, the total input-output function is plotted. For each input value, the model with the highest predicted reward has been used. The discrete step close to zero marks the point in the input space where the system switches between the two models. The optimal position for this point is of course zero. One problem that can occur with this algorithm, and other similar algorithms, is when both models prefer the same part of the input space. This means that the two reinforcement prediction functions predict the same reinforcement for the same inputs and, as a result, both models generate the same actions. This problem can of course be solved if the teacher who generates the external reinforcement signal knows approximately where the breakpoint should be and which model should act on which side. The teacher could then punish the system for selecting the wrong model by giving negative reinforcement. In general, however, the teacher does not know how to divide the problem. In that case, the teacher must try to use a pedagogical reward as discussed in section 2.4.2 on page 20. The teacher could for instance give less reward if the models try to cover the same part of the input space and a higher reward when the models tend to cover different parts of the space. 3.6 Experiments 57 0 1 −20 −40 ouput reward 0.5 0 −60 −0.5 −80 −1 −100 0 1000 2000 3000 iterations 4000 5000 −0.6 −0.3 0 input 0.3 0.6 Figure 3.8: Left: A typical run of the TD-learning system with two local models. The graph shows the accumulated reward in a sliding window of 100 iterations. Right: The total input-output function after convergence. 3.6.3 Discussion If we compare the contents of the prediction matrix memory to the right in figure 3.7 and the combined function of the linear models in the TD-system plotted to the right in figure 3.8, we see that the two systems implement approximately the same function. If we compare the learning behaviour (plotted to the left in figure 3.7 and 3.8), the prediction matrix memory appears to learn faster than the TD-method. It should however be noted that each iteration of the prediction matrix memory has a computational complexity of order O (Q V ), where Q and V are the number of channels used for representing the input and output signals respectively. In this experiment, we used Q = 25 and V = 45. A larger number of channels enhances the performance when the system has converged but increases the required number of iterations until convergence as well as the computational complexity for each iteration. The computational complexity of the second method is of order O (N (X + 1) Y ) per iteration, where N is the number of local models (in this case 2), X is the dimensionality of the input signal (in this case 1) and Y is the dimensionality of the output signal (in this case 1). The algorithms have not been optimized with respect to convergence time. The convergence speed depends on the setting of the learning rate constants α and β and the modulation of the variance parameters σ p and σy . These parameters have only been tuned to constant values that work reasonably well. Better results can be expected if the learning rates are made adaptive, as discussed in section 2.3.2 on page 11. 58 Information representation Chapter 4 Low-dimensional linear models As we have seen in the previous chapter (in section 3.4), local low-dimensional linear models is a good way of representing high-dimensional data in a learning system. The linear models can be seen as basis vectors spanning a (local) subspace of the signal space. The signal can then be (approximately) described in this new basis in terms of projections onto the new basis vectors. For signals with high dimensionality, an iterative algorithm for finding this basis must not exhibit a memory requirement nor a computational cost significantly exceeding O (d ) per iteration, where d is the dimensionality of the signal. Techniques involving matrix multiplications (having memory requirements of order O (d 2 ) and computational costs of order O (d 3 )), quickly become infeasible when signal space dimensionality increases. The purpose of local models is dimensionality reduction which means throwing away information that is not needed. Hence, the criterion for an appropriate local model is dependent on the application. One criterion is to preserve as much variance as possible given a certain dimensionality of the model. This is done by projecting the data on the subspace of maximum data variation, i.e. the subspace spanned by the largest principal components. This is known as principal component analysis (PCA). There is a number of applications in signal processing where principal components play an important role, for example image coding. In applications where relations between two sets of data (e.g. process input and output) are considered, PCA or other self-organizing algorithms for representing the two sets of data separately are not very useful since such methods cannot separate useful information from noise. Consider, for example, two highdimensional signals that are described by their most significant principal components. There is no reason to believe that these descriptions of the signals are related in any way. In other words, the signal in the direction of maximum variance in one space may be totally independent of the signal in the direction of 60 Low-dimensional linear models maximum variance in another space, even if there is a strong relation between the signals. The reason for this is that there is no way of finding the relation between two sets of data just by looking at one of the sets. Instead, the two signal spaces must be considered together. One method for doing this is finding the subspaces in the input and the output spaces for which the data covariation is maximized. These subspaces turn out to be the ones accompanying the largest singular values of the between-sets covariance matrix (Landelius et al., 1995). A singular value decomposition (SVD) of the between-sets covariance matrix corresponds to partial least squares (PLS) (Wold et al., 1984; Höskuldsson, 1988). In general, however, the input to a system comes from a set of different sensors and it is evident that the range (or variance) of the signal values from a given sensor is unrelated to the importance of the received information. The same line of reasoning holds for the output which may consist of signals to a set of different effectuators. In these cases, the covariances between signals are not relevant. There may, for example, be one pair of directions in the two spaces that has a high covariance due to high signal magnitude but has a high noise level, while another pair of directions has an almost perfect correlation but a small signal magnitude and therefore low covariance. Here, correlation between input and output signals is a more appropriate target for analysis since this measure of signal relations is invariant to the signal magnitudes. This approach leads to a canonical correlation analysis (CCA) (Hotelling, 1936) of the two sets of signals. Finally, when the goal is to predict a signal as well as possible in a least square error sense, the basis must be chosen so that this error measure is minimized. This corresponds to a low-rank approximation of multivariate linear regression (MLR). This is also known as reduced rank regression (Izenman, 1975) or as redundancy analysis (van den Wollenberg, 1977). In general, these four different criteria for selecting basis vectors lead to four different solutions. But, as we will see, the problems are related to each other and can be formulated in very similar ways. An important problem which is directly related to the situations discussed above is the generalized eigenproblem or two-matrix eigenproblem (Bock, 1975; Golub and Loan, 1989; Stewart, 1976). In the next section, the generalized eigenproblem is described in some detail and its relation to an energy function called the Rayleigh quotient is shown. It is shown that the four important methods discussed above (principal component analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA) and multivariate linear regression (MLR)) emerge as solutions to special cases of the generalized eigenproblem. In section 4.7, an iterative O (d ) algorithm that solves the generalized eigenproblem by a gradient search on the Rayleigh quotient is presented. The solutions are found in a successive order beginning with the largest eigenvalue and the cor- 4.1 The generalized eigenproblem 61 responding eigenvector. It is shown how to apply this algorithm in order to obtain the required solutions in the special cases of PCA, PLS, CCA and MLR. Throughout this chapter, the variables are assumed to be real valued and have zero mean so that the covariance matrices can be defined as Cxx = E [xxT ]. The zero mean does not impose any limitations on the methods discussed since the mean values can easily be estimated and stored by each local model. The essence of this chapter has been submitted for publication (Borga et al., 1997b). 4.1 The generalized eigenproblem When dealing with many scientific and engineering problems, some version of the generalized eigenproblem sometimes needs to be solved along the way: Aê = λBê or B,1 Aê = λê: (4.1) (In the right-hand equation, B is supposed to be non-singular.) In mechanics, the eigenvalues often correspond to modes of vibration. Here, however, the case where the matrices A and B consist of components which are expectation values from stochastic processes is considered. Furthermore, both matrices are symmetric and, in addition, B is positive definite. The generalized eigenproblem is closely related to the problem of finding the extremum points (i.e. the points of zero derivatives) of a ratio of quadratic forms: r= wT Aw ; wT Bw (4.2) where both A and B are symmetric and B is positive definite. This ratio is known as the Rayleigh quotient and its critical points correspond to the eigensystem of the generalized eigenproblem. To see this, consider the gradient of r: ∂r ∂w = 2 wT Bw (Aw , rBw) = α(Aŵ , rBŵ) ; (4.3) where α = α(w) is a positive scalar. Setting the gradient to 0 gives Aŵ = rBŵ or B,1Aŵ = rŵ (4.4) which is recognized as the generalized eigenproblem (equation 4.1). The solutions ri and ŵi are the eigenvalues and eigenvectors respectively of the matrix B,1 A. This means that the extremum points of the Rayleigh quotient r(w) are solutions to the corresponding generalized eigenproblem. The eigenvalues are the extremum values of the quotient and the eigenvectors are the corresponding 62 Low-dimensional linear models parameter vectors w of the quotient. A special case of the Rayleigh quotient is Fisher’s linear discriminant function (Fisher, 1936) used in classification. In this case, A is the between-class scatter matrix and B is the within-class scatter matrix (see for example Duda and Hart, 1973). The Rayleigh quotient 1 The gradient ( A w − r B w) e 2 1 e → 2 r( w) w 0 ← 0 → w2 2 e → 1 −1 e 1 −1 −1 0 w1 1 −1 0 w1 1 Figure 4.1: Left: The Rayleigh quotient r(w) between two matrices A and B. The curve is plotted as rŵ. The eigenvectors of B,1A are marked as reference. The corresponding eigenvalues are marked as the radii of the two circles. Note that the quotient is invariant to the norm of w. Right: The gradient of r. The arrows indicate the direction of the gradient and the radii of the blobs correspond to the magnitude of the gradient. As an illustration, the Rayleigh quotient is plotted to the left in figure 4.1 for two matrices A and B: A= 1 0 0 0:25 and B= 2 1 1 1 : (4.5) The quotient is plotted as the radius in different directions ŵ. Note that the quotient is invariant to the norm of w. The two eigenvalues are shown as circles with their radii corresponding to the eigenvalues. The figure shows that the eigenvectors e1 and e2 of the generalized eigenproblem coincide with the maximum and minimum values of the Rayleigh quotient. To the right in the same figure, the gradient of the Rayleigh quotient is illustrated as a function of the direction of w. Note that the gradient is orthogonal to w (see equation 4.3). This means that a small change of w in the direction of the gradient can be seen as a rotation of w. 4.1 The generalized eigenproblem 63 The arrows indicate the direction of this orientation and the radii of the blobs correspond to the magnitude of the gradient. The figure shows that the directions of zero gradient coincide with the eigenvectors and that the gradient points towards the eigenvector corresponding to the largest eigenvalue. If the eigenvalues ri are distinct1 (i.e. ri 6= r j for i 6= j), the different eigenvectors are orthogonal in the metrics A and B, i.e. ( ŵTi Bŵ j = 0 for βi > 0 for ( i 6= j i= j ŵTi Aŵ j = and 0 ri βi for for i 6= j i= j (4.6) (see proof B.3.1 on page 157). This means that the wi s are linearly independent (see proof B.3.2 on page 158). Since an n-dimensional space gives n eigenvectors which are linearly independent, fw1 ; : : : ; wn g constitutes a base and any w can be expressed as a linear combination of the eigenvectors. Now, it can be proved (see proof B.3.3 on page 158) that the function r is bounded by the largest and the smallest eigenvalue, i.e. rn r r1 (4.7) which means that there exists a global maximum and that this maximum is r1 . To investigate if there are any other local maxima, we look at the second derivative, or the Hessian H, of r for the solutions to the eigenproblem, Hi = ∂2 r ∂w2 w =ŵi = 2 (A ŵTi Bŵi , riB) (4.8) (see proof B.3.4 on page 159). The Hessian Hi have positive eigenvalues for i > 1, i.e. there exist vectors w such that w T Hi w > 0 8i > 1 (4.9) (see proof B.3.5 on page 159). This means that for all solutions to the eigenproblem except for the largest root, there exists a direction in which r increases. In other words, all extremum points of the function r are saddle points except for the global minimum and maximum points. Since the two-dimensional example in figure 4.1 only has two eigenvalues, they correspond to the maximum and minimum values of r. In the following sections is shown that finding the directions of maximum variance, maximum covariance, maximum correlation and minimum square error can be seen as special cases of the generalized eigenproblem. 1 The eigenvalues will be distinct in all practical applications since all real signals contain noise. 64 Low-dimensional linear models 4.2 Principal component analysis Consider a set of random vectors x (signals) with a covariance matrix defined by Cxx = E [xxT ]: (4.10) Suppose the goal is to find the direction of maximum variation in the signal distribution. The direction of maximum variation means the direction ŵ such that the linear combination x = xT ŵ possesses maximum variance. Hence, finding this direction is equivalent to finding the maximum of ρ = E [xx] = E [ŵT xxT ŵ] = ŵT E [xxT ]ŵ = wT Cxx w : wT w (4.11) This is a special case of the Rayleigh quotient in equation 4.2 on page 61 with A = Cxx and B = I: (4.12) Since the covariance matrix is symmetric, it is possible to decompose it into its eigenvalues and orthogonal eigenvectors as Cxx = E [xxT ] = ∑ λi êi êTi ; (4.13) where λi and êi are the eigenvalues and the orthogonal eigenvectors respectively. Hence, the problem of maximizing the variance, ρ, can be seen as the problem of finding the largest eigenvalue, λ1 , and its corresponding eigenvector since λ1 = êT1 Cxx ê1 = max wT Cxx w wT w = max ρ: (4.14) It is also worth noting that it is possible to find the direction and magnitude of maximum data variation for the inverse of the covariance matrix. In this case, we simply identify the matrices in eq. 4.2 on page 61 as A = I and B = Cxx . The eigenvectors ei are also known as the principal components of the distribution of x. Principal component analysis (PCA) is an old tool in multivariate data analysis. It was used already in 1901 (Pearson, 1901). The projection of data onto the principal components is sometimes called the Hotelling transform after Hotelling (1933) or Karhunen-Loéve transform (KLT) after Karhunen (1947) and Loéve (1963). This transformation is as an orthogonal transformation that diagonalizes the covariance matrix. PCA gives a data dependent set of basis vectors that is optimal in a statistical mean square error sense. This was shown in equation 2.45 on page 33 for one basis vector and the result can easily be generalized to a set of basis vectors by the following reasoning: Given one basis vector, the best we can do is to choose 4.2 Principal component analysis 65 the maximal eigenvector of the covariance matrix. This basis vector describes the signal completely in that direction. Hence, there is nothing more in that direction to describe and the next basis vector should be chosen orthogonal to the first. Now the same problem is faced again, but in a smaller space where the first principal component of the signal is removed. So the best choice of the second basis vector is a unit vector in the direction of the first principal component in this subspace and that direction corresponds to the second eigenvector2 of the covariance matrix. This process can be repeated for all basis vectors. The KLT can be used for image coding (Torres and Kunt, 1996) since it is the optimal transform coding in a mean square error sense. This is, however, not very common. One reason for this is that the KLT is computationally more expensive than the discrete cosine transform (DCT). Another reason is the need for transmission of the data dependent basis vectors. Besides that, in general the mean square error is not a very good error measure for images since two images with a large mean square distance can look very similar to a human observer. Another use for PCA in multivariate statistical analysis is to find linear combinations of variables where the variance is high. Here, it should be noted that PCA is dependent on the units used for measuring. If the unit of one variable is changed, for example from metres to feet, the orientations of the principal components may change. For further details on PCA, see for example the overview by Jolliffe (1986). When dealing with learning systems, it could be tempting to use PCA to find local linear models to reduce the dimensionality of a high-dimensional input (and output) space. The problem with this approach is that the best representation of the input signal is in general not the least mean square error representation of that signal. There may be components in the input signal that have high variances that are totally irrelevant when it comes to generating responses and there may be components with small variances that are very important. In other words, PCA is not a good tool when analysing the relations between two sets of variables. The need for simultaneous analysis of the input and output signals in learning systems was indicated in the quotation from Brooks (1986) on page 8 and also in the wheel-chair experiment (Held and Bossom, 1961; Mikaelian and Held, 1964) mentioned on the same page. 2 The somewhat informal notation “second eigenvector” refers to the eigenvector corresponding to the second largest eigenvalue. 66 Low-dimensional linear models 4.3 Partial least squares Now, consider two sets of random vectors x and y with the between-sets covariance matrix defined by Cxy = E [xyT ]: (4.15) Suppose, this time, that the goal is to find the two directions of maximal data covariation, by which is meant the directions ŵx and ŵy such that the linear combinations x = xT ŵx and y = yT ŵy give maximum covariance. This means that the following function should be maximized: ρ = E [xy] = E [ŵTx xyT ŵy ] = ŵTx E [xyT ]ŵy = qwxT CxywTy T : (4.16) wx wx wy wy Note that, for each ρ, a corresponding value ,ρ is obtained by rotating wx or wy 180 . For this reason, the maximum magnitude of ρ is obtained by finding the largest positive value. This function cannot be written as a Rayleigh quotient. However, the critical points of this function coincide with the critical points of a Rayleigh quotient with proper choices of A and B. To see this, we calculate the derivatives of this function with respect to the vectors wx and wy (see proof B.3.6 on page 160): ( ∂ρ 1 = kw k (Cxy ŵy x 1 = kw k (Cyx ŵx y ∂wx ∂ρ ∂wy , ρŵx) , ρŵy) (4.17) : Setting these expressions to zero and solving for wx and wy results in ( = ρ2 ŵx = ρ2 ŵy : Cxy Cyx ŵx Cyx Cxy ŵy (4.18) This is exactly the same result as that given by the extremum points of r in equation 4.2 on page 61 if the matrices A and B and the vector w are chosen according to: A= 0 Cyx Cxy 0 , B= I 0 0 I w= and µ ŵ x x µy ŵy : (4.19) This is easily verified by insertion of the expressions above into equation 4.4, which results in 8 <Cxyŵy :Cyxŵx µ = r µx ŵx y µy = r µ ŵy x : (4.20) 4.4 Canonical correlation analysis 67 Solving for wx and wy gives equation 4.18 with r2 = ρ2 . Hence, the problem of finding the direction and magnitude of the largest data covariation can be seen as maximizing a special case of the Rayleigh quotient (equation 4.2 on page 61) with the appropriate choice of matrices. The between-sets covariance matrix can be expanded by means of singular value decomposition (SVD) where the two sets of vectors fêxi g and fêyi g are mutually orthogonal: Cxy = ∑ λi êxi êTyi (4.21) where the positive numbers, λi , are referred to as the singular values. Since the basis vectors are orthogonal, the problem of maximizing the quotient in equation 4.16 is equivalent to finding the largest singular value: λ1 = êTx1 Cxy êy1 = max qwxT CxywTy T = max ρ: (4.22) wx wx wy wy The SVD of a between-sets covariance matrix is directly related to the method of partial least squares (PLS). PLS was developed in econometrics in the 1960s by Herman Wold. It is most commonly used for regression in the field of chemometrics (Wold et al., 1984). For an overview, see for example Geladi and Kowalski (1986) and Höskuldsson (1988). In PLS regression, the principal vectors corresponding to the largest principal values are used as a new, lower dimensional, basis for the signal. A regression of y onto x is then performed in this new basis. As in the case of PCA, the scaling of the variables affects the solutions of the PLS. The reason for this is the maximum covariance criteria; the covariance between two variables is proportional to the variances of the variables. Therefore, a scaling of the x variables to unit variance is sometimes suggested (Wold et al., 1984). Such a solution can of course also amplify the noise which can cause problems in the parameter estimation3 . 4.4 Canonical correlation analysis Again, consider two random variables x and y with zero mean and stemming from a multi-normal distribution with the total covariance matrix C= 3 An C xx Cyx Cxy Cyy " T # =E x y x y : (4.23) example of such a problem has been reported from the paper industry (personal communication). In that case, the normalized data had to be filtered to reduce the amplified noise! The filtering will likely introduce new artifacts. 68 Low-dimensional linear models Now, suppose that the goal is to find the directions of maximum data correlation. Consider the linear combinations x = xT ŵx and y = yT ŵy of the two variables respectively. This means that the function to be maximized is ρ= = pEE[x[2xy]E] [y2] = q E [ŵTx xyT ŵy ] E [ŵTx xxT ŵx ]E [ŵTy yyT ŵy ] q (4.24) wTx Cxy wy : wTx Cxx wx wTy Cyy wy Also in this case, since ρ changes sign if wx or wy is rotated 180 , it is sufficient to find the positive values. Like equation 4.16, this function cannot be written as a Rayleigh quotient. But also in this case, it can be shown that the critical points of this function coincide with the critical points of a Rayleigh quotient with proper choices of A and B. The partial derivatives of ρ with respect to wx and wy are (see proof B.3.7 on page 160) 8 ∂ρ > < ∂w > : ∂w∂ρ x a = kw k x y = kwa k y ŵ C ŵ Cxy ŵy , ŵ C ŵ Cxx ŵx ŵ C ŵ T x T x Cyx ŵx , xy y xx x T y yx x ŵTy Cyy ŵy Cyy ŵy (4.25) ; where a is a positive scalar. Setting the derivatives to zero gives the equation system 8 <Cxyŵy = ρλxCxxŵx :Cyxŵx = ρλyCyyŵy where 1 λx = λ, y = s (4.26) ; ŵTy Cyy ŵy : ŵTx Cxx ŵx (4.27) λx is the ratio between the standard deviation of y and the standard deviation of x and vice versa. The λs can be interpreted as scaling factors between the linear combinations. Rewriting equation system 4.26 gives ( ,1 Cxy C,1Cyx ŵx = ρ2 ŵx Cxx yy , 1 1 2 Cyy Cyx C, xx Cxy ŵy = ρ ŵy : (4.28) ,1Cxy C,1 Cyx Hence, ŵx and ŵy are found as the eigenvectors of the matrices Cxx yy ,1Cyx C,1 Cxy respectively. The corresponding eigenvalues ρ2 are the squared and Cyy xx 4.4 Canonical correlation analysis 69 canonical correlations. The eigenvectors corresponding to the largest eigenvalue ρ21 are the vectors ŵx1 and ŵy1 that maximize the correlation between the canonical variates x1 = xT ŵx1 and y1 = yT ŵy1 . Now, if A= 0 Cyx Cxy 0 ; B= C xx 0 0 Cyy w= and w µ ŵ x wy = x x µy ŵy (4.29) equation 4.4 can be written as 8 <Cxyŵy = r µµ Cxxŵx :Cyxŵx = r µµ Cyyŵy x y y x which is recognized as equation 4.26 for ρλx = (4.30) ; r µµxy and ρλy = µ r µyx . Solving for wx and wy in equation 4.30, gives equation 4.28 with r2 = ρ2 . This shows that the equations for the canonical correlations are obtained as the result of maximizing the Rayleigh quotient (equation 4.2 on page 61). Canonical correlation analysis was developed by Hotelling (1936). Some of the results presented here can also be found in (Borga, 1995; Knutsson et al., 1995; Borga et al., 1997a). Although being a standard tool in statistical analysis (see for example Anderson, 1984), where canonical correlation has been used for example in economics, medical studies, meteorology and even in classification of malt whisky (Lapointe and Legendre, 1994) and wine (Montanarella et al., 1995), it is surprisingly unknown in the fields of learning and signal processing. Some exceptions are Becker (1996), Kay (1992), Fieguth et al. (1995), Das and Sen (1994) and Li et al. (1997). An important property of canonical correlations is that they are invariant with respect to affine transformations of x and y. An affine transformation is given by a translation of the origin followed by a linear transformation. The translation of the origin of x or y has no effect on ρ since it leaves the covariance matrix C unaffected. Invariance with respect to scalings of x and y follows directly from equation 4.24. For invariance with respect to other linear transformations see proof B.3.8 on page 161. Hence, in contrast to PLS, there is no need for normalization of the variables in CCA. Another important property is that the canonical correlations are uncorrelated for different solutions, i.e. 8 > <E [xx] E [yy] > :E [xy] T T = E [wT xi xx wx j ] = wxi Cxx wx j = 0 T T = E [wT yi yy wy j ] = wyi Cyy wy j = 0 T T = E [wT xi xy wy j ] = wxi Cxy wy j = 0 according to equation 4.6. for i 6= j; (4.31) 70 Low-dimensional linear models 4.4.1 Relation to mutual information and ICA As mentioned in section 2.5.3, there is a relation between correlation and mutual information (equation 2.44). Since information is additive for statistically independent variables (equation 2.33) and the canonical variates are uncorrelated, the mutual information between x and y is the sum of mutual information between the variates xi and yi if there are no higher order statistic dependencies than correlation (second-order statistics). For Gaussian variables this means I (x; y) = 1 1 log 2 ∏i (1 , ρ2i ) = 1 log 2∑ i 1 (1 , ρ2i ) (4.32) using equation 2.44 on page 32. This is also more formally shown in proof B.3.9 on page 162. Kay (1992)4 has shown that this relation plus a constant holds for all elliptically symmetrical distributions of the form c f ((z , z̄)T C,1(z , z̄)): (4.33) Minimizing mutual information between signal components is known as independent component analysis (ICA) (see for example Comon, 1994). If there are no higher order statistic dependencies than correlation (e. g. if the variables are jointly Gaussian5 ), the canonical correlates xi , x j , i 6= j are independent components since they are uncorrelated. 4.4.2 Relation to SNR The correlation is strongly related to signal to noise ratio (SNR), which is a more commonly used measure in signal processing. This relation is used later in this thesis. Consider a signal x and two noise signals η1 and η2 all having zero mean6 and all being uncorrelated with each other. Let S = E [x2 ] and Ni = E [η2i ] be the energy of the signal and the noise signals respectively. Then the correlation 4 There is a difference by a factor 0.5 between equation 4.32 and Kay’s equation, due to a typographical error. 5 The definition of ICA requires that at most one of the source components is Gaussian (Comon, 1994). 6 The assumption of zero mean is for convenience. A non-zero mean does not affect the SNR or the correlation. 4.4 Canonical correlation analysis 71 between a(x + η1 ) and b(x + η2 ) is pE [Ea2[(ax(+x +ηη)12)]bE(x[b+2(ηx2+)]η )2] 1 2 2 E x = q, , E [x2 ] + E η2 E [x2 ] + E η2 ρ= 1 = p(S + NS)(S + N ) 1 (4.34) 2 : 2 Note that the amplification factors a and b do not affect the correlation or the SNR. Equal noise energies In the special case where the noise energies are equal, i.e. N1 = N2 = N, equation 4.34 can be written as ρ= S : S+N (4.35) This means that the SNR can be written as S N = ρ 1,ρ (4.36) : Here, it should be noted that the noise affects the signal twice, so this relation between SNR and correlation is perhaps not so intuitive. This relation is illustrated in figure 4.2 (top). Correlation between a signal and the corrupted signal Another special case is when N1 = 0 and N2 = N. Then, the correlation between a signal and a noise-corrupted version of that signal is ρ= pS(SS+ N ) : (4.37) In this case, the relation between SNR and correlation is S N = ρ2 : 1 , ρ2 (4.38) This relation between correlation and SNR is illustrated in figure 4.2 (bottom). 72 Low-dimensional linear models 50 SNR (dB) 25 0 −25 −50 0 0.2 0.4 0.6 Correlation 0.8 1 0.2 0.4 0.6 Correlation 0.8 1 50 SNR (dB) 25 0 −25 −50 0 Figure 4.2: Top: The relation between correlation and SNR for two signals each corrupted by uncorrelated noise. Both noise signals have the same energy. Bottom: The relation between correlation and SNR. The correlation is measured between a signal and a noise-corrupted version of that signal. 4.5 Multivariate linear regression 73 4.5 Multivariate linear regression Again, consider two random variables x and y with zero mean and stemming from a multi-normal distribution with covariance as in equation 4.23. In this case, the goal is to minimize the square error ε2 = E [ky , βxT ŵx ŵy k2 ] , 2βŵTx xyT ŵy + β2ŵTx xxT ŵx ] T T 2 T = E [y y] , 2βŵx Cxy ŵy + β ŵx Cxx ŵx T = E [y y (4.39) ; i.e. a rank-one approximation of the MLR of y onto x based on minimum square error. The problem is to find not only the regression coefficient β, but also the optimal basis ŵx and ŵy . To get an expression for β, we calculate the derivative ∂ε2 ∂β =2 ,βŵT C x , ŵTx Cxyŵy xx ŵx (4.40) : Setting the derivative equal to zero gives β= ŵTx Cxy ŵy : ŵTx Cxx ŵx (4.41) By inserting this expression into equation 4.39 we get ε2 = E [yT y] , 2 (ŵT x Cxy ŵy ) : ŵTx Cxx ŵx (4.42) Since ε2 cannot be negative and the left term is independent of the parameters, we can minimize ε2 by maximizing the quotient to the right in equation 4.42, i.e. maximizing the quotient ρ= pŵŵx TCCxyŵŵy T x xx = x q wTx Cxy wy : (4.43) wTx Cxx wx wTy wy Note that if wx and wy minimize ε2 , the negation of one or both of these vectors will give the same minimum. Hence, it is sufficient to maximize the positive root. Like in the two previous cases, this function cannot be written as a Rayleigh quotient, but its critical points coincide with the critical points of a Rayleigh quotient with proper choices of A and B. The partial derivatives of ρ with respect to wx and wy are (see proof B.3.10 on page 163) 8 ∂ρ < ∂w : ∂w∂ρ , βCxx ŵx) Cyx ŵx , ρβ ŵy x a = kw k (Cxy ŵy x y a = kw k x 2 : (4.44) 74 Low-dimensional linear models Setting the derivatives to zero gives the equation system 8 <Cxyŵy = βCxxŵx :Cyxŵx = ρβ ŵy (4.45) 2 ; which gives ( Now, if we let A= 0 Cyx Cxy 0 ; B= ,1 Cxy Cyx ŵx = ρ2 ŵx Cxx 1 2 Cyx C, xx Cxy ŵy = ρ ŵy : C 0 I xx 0 w= and (4.46) w µ ŵ x wy = x x µy ŵy ; (4.47) equation 4.4 can be written as 8 <Cxyŵy = r µµ Cxxŵx :Cyxŵx = r µµ ŵy x y y x (4.48) ; which is recognized as equation 4.45 for β = r µµxy and ρ2 β = µ r µyx . Solving equation 4.48 for wx and wy gives equation 4.46 with r2 = ρ2 . This shows that the minimum square error in equation 4.39 is found as a result of maximizing the Rayleigh quotient in equation 4.2 on page 61 for the proper choice of matrices A and B and regression coefficient β. So far, the first pair of eigenvectors wx1 and wy1 , i.e. a rank-one solution, has been discussed. Intuitively, a rank N regression can be expected to be optimized (in a mean square error sense) if the N first pairs of eigenvectors are used, i.e. " ε 2 =E N ky , ∑ i=1 k βi ŵTxi xŵyi 2 # (4.49) is minimized if wxi and wyi are the solutions to equation 4.46 corresponding to the N largest eigenvalues. To see that this really is the case, note that the eigenvec1 tors wyi in Y are orthogonal since Cyx C, xx Cxy in equation 4.46 is symmetric. The orthogonality of the wy s is explained by the Cartesian separability of the square error; when the error in one direction is minimized, no more can be done in that direction to reduce the error. This means that the minimization of ε2 in equation 4.49 can be seen as N separate problems that can be solved consecutively beginning with the first solution that minimizes equation 4.39. When the first solution is found, the next solution can be searched for in the subspace orthogonal to wy1 . 4.6 Comparisons between PCA, PLS, CCA and MLR PCA 0 PLS Cxx I Cyx Cyx Cxy 0 Cyx Cxy 0 0 MLR B Cxy 0 0 CCA A 75 I 0 C 0 I xx 0 C xx 0 0 Cyy 0 I Table 4.1: The matrices A and B for PCA, PLS, CCA and MLR. Now since fwyi g is orthogonal, the next solution is the second pair of eigenvectors and so on. Since fwyi g is orthogonal, the solutions are not unique; any set of vectors spanning the same subspace in Y can be used to minimize ε2 in equation 4.49 but, of course, with other wxi s and βs. If all solutions to the eigenproblem in equation 4.46 and the corresponding βi s are used, a solution for multivariate linear regression (MLR), also known as the Wiener filter, is obtained. The mean square sum of the eigenvalues, i.e. ∑ ρ2i 1 dim(Y ) = tr(Cyx C, xx Cxy )=dim (Y ) = i is known as the redundancy index (Stewart and Love, 1968). It should be noted that the regression coefficient β defined in equation 4.41 is valid for any choice of ŵx and ŵy . In particular, if we use the directions of maximum variance, β is the regression coefficient for principal components regression (PCR). For the directions of maximum covariance, β is the regression coefficient for PLS regression. 4.6 Comparisons between PCA, PLS, CCA and MLR The similarities and differences between the four methods can be seen by comparing the matrices A and B in the generalized eigenproblem (equation 4.1 on page 61). The matrices are listed in table 4.1. MLR differs from the other three problems in that it is formulated as a mean square error problem, while the other three methods are formulated as maximi- 76 Low-dimensional linear models sation problems. Reduced rank multivariate linear regression can, for example, be used to increase the stability of the predictors when there are more parameters than observations, when the relation is known to be of low rank or, maybe most importantly, when a full rank solution is unobtainable due to computational costs. The regression coefficients can of course also be used for regression in the first three cases. In the case of PCA, the idea is to separately reduce the dimensionality of the X and Y spaces and to do a regression of the first principal components of Y on the first principal components of X . This method is known as principal components regression. The obvious disadvantage here is that there is no reason to believe that the principal components of X are related to the principal components of Y . To avoid this problem, PLS regression is sometimes used. Clearly, this choice of basis is better than PCA for regression purposes since directions of high covariance are selected, which means that a linear relation is easier to find. However, neither of these solutions results in minimum least squares error. This is only obtained using the directions corresponding to the MLR problem. It is not only the MLR that can be formulated as a mean square error problem. van der Burg (1988) formulated CCA as a mean square error minimization problem: minimize ε 2 "N =E ∑ (x i=1: T ŵxi , y ŵyi ) T # 2 ; (4.50) where N is the rank of the solution. In this way, CCA can be seen as a supervised learning method as discussed in section 2.6. PCA differs from the other three methods in that it only concerns one set of variables while the other three concern relations between two sets of variables. The difference between PLS, CCA and MLR can be seen by comparing the matrices in the corresponding eigenproblems (see table 4.1). In CCA, the between-sets covariance matrices are normalized with respect to the within-set covariances in both the x and the y spaces. In MLR, the normalization is done only with respect to the x space covariance while the y space, where the square error is defined, is left unchanged. In PLS, no normalization is done. Hence, these three cases can be seen as the same problem, covariance maximization, where the variables have been subjected to different, data dependent, scaling. The main difference between CCA and the other three methods is that CCA is closely related to mutual information as described in section 4.4.1 and, hence, can easily be motivated in information theoretical terms. Because of this relation, it is a bit surprising that canonical correlation seems to be rather unknown in the signal processing, learning and neural networks societies. As an example, a search for “neural network(s)” together with “canonical correlation(s)” in the SciSearch Database of the Institute for Scientific Information, Philadelphia, gave 4.6 Comparisons between PCA, PLS, CCA and MLR CCA MLR PLS 77 PCA X Y Figure 4.3: Examples of eigenvectors using CCA, MLR, PLS and PCA on the same sets of data. 3 hits. A corresponding search for “partial least square(s)” gave 103 hits, for “linear regression” 212 hits and for “principal component(s)” 287 hits7 . The same test with “signal processing” instead of “neural networks” gave 2, 5, 18 and 31 hits respectively. This result does not, of course, mean that all articles that matched “principal component(s)” presented learning methods based on PCA. But it may indicate the difference in interest in, or awareness of, the different methods within these fields of research. To see how these four different special cases of the generalized eigenproblem may differ, the solutions for the same data are plotted in figure 4.3. The data are two-dimensional in X and Y and randomly distributed with zero mean. The top row shows the eigenvectors in X for the CCA, MLR, PLS and PCA respectively. The bottom row shows the solutions in Y . Note that all solutions except the two solutions for CCA and the X -solution for MLR are orthogonal. Figure 4.4 shows the correlation, mean square error, covariance and variance of the data projected onto the first eigenvectors for each method. The figure shows that: the correlation is maximized for the CCA solution; the mean square error is minimized for the MLR solution; the covariance is maximized for the PLS solution; the variance is maximized for the PCA solution. 7 The search was made on November 4, 1997, through the Norwegian BIBSYS library system (http://www.bibsys.no). The “free text” field was used which performs a search in the title, abstract and keywords. 78 Low-dimensional linear models mse corr 0.5 10 0.45 9 0.4 8 0.35 7 cov 1.4 var 3 1.2 2.5 1 2 0.3 6 0.25 5 0.2 4 0.15 3 0.1 2 0.8 1.5 0.6 1 0.4 0.5 0.2 0 PCA PLS MLR CCA PCA PLS 0 MLR CCA 0 PCA PLS MLR CCA 0 PCA 1 PLS MLR CCA 0.05 Figure 4.4: The correlation, mean square error, covariance and variance when using the first pair of vectors for each method. The correlation is maximized for the CCA solution. The mean square error is minimized for the MLR solution. The covariance is maximized for the PLS solution. The variance is maximized for the PCA solution. (See section 4.6) 4.7 Gradient search on the Rayleigh quotient In this section is shown that the solutions to the generalized eigenproblem can be found and, hence, PCA, PLS, CCA or MLR can be performed by a gradient search on the Rayleigh quotient. Finding the largest eigenvalue In the previous section was shown that the only stable critical point of the Rayleigh quotient is the global maximum (equation 4.9 on page 63). This means that it should be possible to find the largest eigenvalue of the generalized eigenproblem and its corresponding eigenvector by performing a gradient search on the Rayleigh quotient (equation 4.2 on page 61). This can be done by using an iterative algo- 4.7 Gradient search on the Rayleigh quotient 79 rithm: w(t + 1) = w(t ) + ∆w(t ); (4.51) where the update vector ∆w, on average, lies in the direction of the gradient: E [∆w] = β ∂r ∂w = α(Aŵ , rBŵ) (4.52) ; where α and β are positive numbers. α is the gain controlling how far, in the direction of the gradient, the vector estimate is updated at each iteration. This gain could be constant as well as data or time dependent, as discussed in section 2.3.2. In all four cases treated here, A has got at least one positive eigenvalue, i.e. there exists an r > 0. An update rule such that E [∆w] = α(Aŵ , Bw) (4.53) can then be used to find the positive eigenvalues. Here, the length of the vector represents the corresponding eigenvalue, i.e. kwk = r. To see this, consider a choise of w that gives r < 0. Then wT ∆w < 0 since wT Aw < 0 and wT Bw 0. This means that kwk will decrease until r becomes positive. The function Aŵ , Bw is illustrated in figure 4.5 together with the Rayleigh quotient plotted to the left in figure 4.1 on page 62. Finding successive eigenvalues Since the learning rule defined in equation 4.52 maximizes the Rayleigh quotient in equation 4.2 on page 61, it will find the largest eigenvalue λ1 and a corresponding eigenvector ŵ1 = ê1 of equation 4.1 on page 61. The question that naturally arises is if, and how, the algorithm can be modified to find the successive eigenvalues and vectors, i.e. the successive solutions to the eigenvalue equation 4.1. Let G denote the n n matrix B,1 A. Then the n equations for the n eigenvalues solving the eigenproblem in equation 4.1 on page 61 can be written as GE = ED ) G = EDE,1 = ∑ λi êi fTi ; (4.54) where the eigenvalues and vectors constitute the matrices D and E respectively: 2λ 1 6 D=4 0 0 .. . λn 3 75 ; 2j j3 E = 4ê1 ên 5 j j ; 2, 6 E,1 = 4 , fT1 .. . fTn 3 75 , , : (4.55) 80 Low-dimensional linear models ^ − B w) The gradient ( A w 1 w2 r( w) 0 −1 −1 0 w1 1 Figure 4.5: The function Aŵ , Bw, for the same matrices A and B as in figure 4.1, plotted for different w. The Rayleigh quotient is plotted as reference. The vectors, fi , appearing in the rows of the inverse of the matrix containing the eigenvectors are the dual vectors of the eigenvectors êi , which means that fTi ê j = δi j : (4.56) ffi g are also called the left eigenvectors of G and fêi g and ff̂i g are said to be biorthogonal. Remember (from equation 4.6 on page 63) that the eigenvectors êi are both A and B orthogonal, i.e. êTi Aê j = 0 and êTi Bê j = 0 for i 6= j: (4.57) Hence, the dual vectors fi possessing the property in equation 4.56 can be found by choosing them according to: fi = Bêi : T êi Bêi (4.58) Now, if ê1 is the eigenvector corresponding to the largest eigenvalue of G, the new matrix H = G , λ1ê1 fT1 (4.59) 4.7 Gradient search on the Rayleigh quotient 81 has the same eigenvectors and eigenvalues as G except for the eigenvalue corresponding to ê1 , which now becomes 0 (see proof B.3.11 on page 164). This means that the eigenvector corresponding to the largest eigenvalue of H is the same as the one corresponding to the second largest eigenvalue of G. Since the algorithm starts by finding the vector ŵ1 = ê1 , it is only necessary to estimate the dual vector f1 in order to subtract the correct outer product from G and remove its largest eigenvalue. In our case, this is a little bit tricky since G is not generated directly. Instead, its two components A and B must be modified in order 0 to produce the desired subtraction. Hence, we want two modified components, A 0 and B , with the following property: B ,1 A 0 0 =B ,1 A , λ1ê1 fT : (4.60) 1 A simple solution is obtained if only one of the matrices is modified and the other matrix is kept fixed: B 0 =B and A 0 =A , λ1Bê1 fT1 (4.61) : This modification can be accomplished by estimating a vector u1 = λ1 Bê1 = Bw1 iteratively as: u1 (t + 1) = u1 (t ) + ∆u1(t ) (4.62) E [∆u1] = α (rBŵ1 , u1 ) : (4.63) where Once this estimate has converged, u1 product in equation 4.61: λ1 Bê1 fT1 = = λ1 Bê1 can be used to express the outer λ1 Bê1 êT1 BT êT1 BêT1 = u1uT1 : êT1 u1 (4.64) Now A0 can be estimated and, hence, a modified version of the learning algorithm in equation 4.52 which finds the second eigenvalue and the corresponding eigenvector to the generalized eigenproblem is obtained: 0 E [∆w] = α A ŵ , rBŵ =α u1 uT A , T 1 ŵ , rBŵ ŵ1 u1 : (4.65) The vector w1 is the solution first produced by the algorithm, i.e. the largest eigenvalue and the corresponding eigenvector. This scheme can of course be repeated in order to find the third eigenvalue by subtracting the second solution in the same way and so on. Note that this method does not put any demands on the range of B in contrast to exact solutions involving matrix inversion. In the following four sub-sections is shown how this iterative algorithm can be applied to the four important problems described in the previous section. 82 Low-dimensional linear models 4.7.1 PCA Finding the largest principal component The direction of maximum data variation can be found by a stochastic gradient search according to equation 4.53 with A and B defined according to equation 4.12: A = Cxx B = I: and (4.12) This leads to an unsupervised Hebbian learning algorithm that finds both the direction of maximum data variation and the variance of the data in that direction: E [∆w] =γ ∂ρ ∂w = α (Cxx ŵ , w) = α E [xxT ŵ , w] (4.66) : The update rule for this algorithm is given by ∆w = α (xxT ŵ , w); (4.67) where the length of the vector represents the estimated variance, i.e. kwk = ρ. (Note that ρ in this case is always positive.) Note that this algorithm finds both the direction of maximum data variation as well as how much the data vary along that direction. Often algorithms for PCA only find the direction of maximum data variation. If one is also interested in the variation along this direction, another algorithm needs to be employed. This is the case for the well-known PCA algorithm presented by Oja and Karhunen (1985). Finding successive principal components In order to find successive principal components, recall that A = Cxx and B = I. Hence the matrix G = B,1 A = Cxx is symmetric and has orthogonal eigenvectors. This means that the dual vectors and the eigenvectors become indistinguishable and that no other vector than w itself needs to be estimated. The outer product in equation 4.61 then becomes: λ1 Bê1 fT1 T T = λ1 Iê1 ê1 = w1 ŵ1 : (4.68) This means that the modified learning rule for finding the second eigenvalue can be written as 0 E [∆w] = α A ŵ , Bw =α ,(C , w ŵT )ŵ , w xx 1 1 : (4.69) A stochastic approximation of this rule is achieved the vector w is updated by ∆w = α ,(xxT , w ŵT )ŵ , w 1 1 (4.70) 4.7 Gradient search on the Rayleigh quotient 83 at each time step. As mentioned in section 4.2, it is possible to perform a PCA on the inverse of the covariance matrix by choosing A = I and B = Cxx . The learning rule associated with this behaviour then becomes: ∆w = α (ŵ , xxT w): (4.71) 4.7.2 PLS Finding the largest singular value If the aim is to find the directions of maximum data covariance, the matrices A and B are defined according to equation 4.19: A= 0 Cxy 0 Cyx I 0 , B= 0 I and w= µ ŵ x x µy ŵy (4.19) : Since w on average should be updated in the direction of the gradient, the update rule in equation 4.53 gives: ∂r E [∆w] = γ ∂w =α 0 Cxy I 0 ŵ , r ŵ 0 I 0 Cyx (4.72) : This behaviour is accomplished if at each time step, the vector w is updated with ∆w = α 0 yxT xyT ŵ , w 0 (4.73) ; where the length of the vector at convergence represents the covariance, i.e. kwk = r = ρ. This can be done since it is sufficient to search for positive values of ρ. Finding successive singular values Also in this case, the special structure of the A and B matrices simplifies the procedure of finding the subsequent directions with maximum data covariance. The compound matrix G = B,1 A = A is symmetric and has orthogonal eigenvectors, which are identical to their dual vectors. The outer product for modification of the matrix A in equation 4.61 is identical to the one presented in the previous section: λ1 Bê1 fT1 = λ1 I 0 0 I ê1 êT1 T = w1 ŵ1 : (4.74) A modified learning rule for finding the second eigenvalue can thus be written as 0 E [∆w] = α A ŵ , Bw = α 0 Cyx Cxy 0 , w1ŵT1 ŵ , I 0 0 I w : (4.75) 84 Low-dimensional linear models A stochastic approximation of this rule is achieved if the vector w is updated at each time step by ∆w = α 0 xyT 0 yxT , w1ŵT1 ŵ , w (4.76) : 4.7.3 CCA Finding the largest canonical correlation Again, the algorithm in equation 4.53 for solving the generalized eigenproblem can be used for the stochastic gradient search. With the matrices A and B and the vector w as in equation 4.29: A= 0 Cyx Cxy 0 B= ; C 0 the update direction is: ∂r E [∆w] = γ w xx =α 0 Cyy 0 Cyx and w= w µ ŵ x wy = x x µy ŵy (4.29) Cxy C 0 ŵ , r xx ŵ 0 Cyy 0 (4.77) : This behaviour is accomplished if at each time step the vector w is updated with ∆w = α 0 yxT xyT xxT ŵ , 0 0 0 w yyT (4.78) : Since kwk = r = ρ when the algorithm converges, the length of the vector represents the correlation between the variates. Finding successive canonical correlations In the two previous cases it was easy to cancel out an eigenvalue because the matrix G was symmetric. This is not the case for canonical correlation. In this case C,1 G = B,1 A = xx 0 0 ,1 Cyy 0 Cyx Cxy 0 = 1 0 C, xx Cxy ,1 Cyx Cyy 0 : (4.79) Because of this, it is necessary to estimate the dual vector f1 corresponding to the eigenvector ê1 , or rather the vector u1 = λ1 Bê1 as described in equation 4.63: E [∆u1] = α (Bw1 , u1) = α C xx 0 0 w , u1 Cyy 1 : (4.80) 4.7 Gradient search on the Rayleigh quotient xxT 85 A stochastic approximation of this rule is given by ∆u1 = α 0 0 w1 , u 1 yyT (4.81) : With this estimate, the outer product in equation 4.61 can be used to modify the matrix A: 0 u1 uT A = A , λ1Bê1 fT1 = A , T 1 : (4.82) ŵ1 u1 A modified version of the learning algorithm in equation 4.78 which finds the second largest canonical correlation and its corresponding directions can be written on the following form: 0 E [∆w] = α A ŵ , Bw 0 =α Cxy 0 Cyx T , ŵu1Tuu1 ŵ , C0xx C0 w : yy 1 1 (4.83) Again to get a stochastic approximation of this rule, the update at each time step is performed according to: ∆w = α 0 xyT 0 yxT T T , ŵu1Tuu1 ŵ , xx0 1 1 0 w yyT (4.84) : Note that this algorithm simultaneously finds both the directions of canonical correlations and the canonical correlations ρi in contrast to the algorithm proposed by Kay (1992), which only finds the directions. 4.7.4 MLR Finding the directions for minimum square error Also here, the algorithm in equation 4.53 can be used for a stochastic gradient search. With the A, B and w according to equation 4.47: A= 0 Cyx Cxy 0 ; B= C xx 0 the update direction is: ∂r E [∆w] = γ ∂w =α 0 I 0 0 Cyx w= and w µ ŵ x wy = x ; (4.47) Cxy C 0 ŵ , r xx ŵ 0 I 0 x µy ŵy : (4.85) This behaviour is accomplished if the vector w at each time step is updated with ∆w = α yxT xyT xxT ŵ , 0 0 0 w I : (4.86) Since kwk = r = ρ when the algorithm converges, the regression coefficient is obtained as β = kwk µµxy . 86 Low-dimensional linear models Finding successive directions for minimum square error Also in this case, the dual vectors must be used to cancel out the detected eigenvalues. The non-symmetric matrix G is C,1 0 G = B,1A = xx 0 0 I Cyx Cxy 0 0 = 1 C, xx Cxy 0 Cyx (4.87) : Again, the vector u1 = λ1 Bê1 is estimated as described in equation 4.63: E [∆u1] = α (Bw1 , u1 ) = α C 0 w , u1 I 1 xx 0 A stochastic approximation for this rule is given by ∆u1 = α xxT 0 0 I w1 , u 1 (4.88) : (4.89) : With this estimate, the outer product in equation 4.61 can be used to modify the matrix A: A 0 T =A , λ1Bê1fT1 = A , ŵu1Tuu1 (4.90) : 1 1 A modified version of the learning algorithm in equation 4.86 which finds the successive directions of minimum square error and their corresponding regression coefficient can be written on the following form: 0 E [∆w] = α A ŵ , Bw =α 0 Cyx Cxy 0 T , ŵu1Tuu1 1 1 ŵ , C xx 0 0 w : I (4.91) Again, to get a stochastic approximation of this rule, the update at each time step is performed according to: ∆w = α 0 yxT xyT 0 T T , ŵu1Tuu1 ŵ , xx0 1 1 0 w I : (4.92) As mentioned earlier, the wy s are orthogonal in this case. This means that this method can be used for successively building up a low-rank approximation of MLR by adding a sufficient number of solutions, i.e. N ỹ = ∑ βi xT ŵxi ŵyi ; i=1 where ỹ is the estimated y and N is the rank. (4.93) 4.8 Experiments 87 4.8 Experiments The memory requirement as well as the computational cost per iteration of the presented algorithm is of order O (Nd ), where N is the number of estimated models, i.e. the rank of the solution, and d is the dimensionality of the signal. This enables experiments in signal spaces having dimensionalities which would be impossible to handle using traditional techniques involving matrix multiplications (having memory requirements of order O (d 2 ) and computational costs of order O (d 3 )). This section presents some experiments using the algorithm for analysis of stochastic processes. First, the algorithm is employed to perform PCA, PLS, CCA, and MLR. Here, the dimensionality of the signal space is kept reasonably low in order to make a comparison with the performance of an optimal (in the sense of maximum likelihood (ML)) deterministic solution which is calculated for each iteration, based on the data accumulated so far. In the final experiment, the algorithm is applied to a process in a high-dimensional (1,000 dimensions) signal space. In this case, the update factor is made data dependent and the output from the algorithm is post-filtered in order to meet requirements of quick convergence together with algorithm robustness. The errors in magnitude and angle were calculated relative the correct answer wc . The same error measures were used for the output from the algorithm as well as for the ML estimate: εm (w) = kwc k,kwk (4.94) εa (w) = arccos(ŵT ŵc ): (4.95) 4.8.1 Comparisons to optimal solutions The test data for these four experiments were generated from a 30-dimensional Gaussian distribution such that the eigenvalues of the generalized eigenproblem decreased exponentially from 0.9: λi = 0:9 2 i,1 3 : The two largest eigenvalues (0.9 and 0.6) and the corresponding eigenvectors were simultaneously searched for. In the PLS, CCA and MLR experiments, the dimensionalities of the signal vectors belonging to the x and y parts of the signal were 20 and 10 respectively. The average angular and magnitude errors were calculated based on 10 different runs. This computation was made for each iteration, both for the algorithm 88 Low-dimensional linear models PCA: Mean angular error for w1 PCA: Mean angular error for w2 π/2 π/2 π/4 π/4 0 2000 4000 6000 8000 10000 0 2000 PCA: Mean norm error for w 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 6000 8000 10000 8000 10000 2 1 4000 6000 PCA: Mean norm error for w 1 1 2000 4000 2000 Figure 4.6: Results for the PCA case. 4000 6000 8000 10000 4.8 Experiments 89 PLS: Mean angular error for w1 PLS: Mean angular error for w2 π/2 π/2 π/4 π/4 0 2000 4000 6000 8000 10000 0 2000 PLS: Mean norm error for w 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 6000 8000 10000 8000 10000 2 1 4000 6000 PLS: Mean norm error for w 1 1 2000 4000 2000 Figure 4.7: Results for the PLS case. 4000 6000 8000 10000 90 Low-dimensional linear models CCA: Mean angular error for w1 CCA: Mean angular error for w2 π/2 π/2 π/4 π/4 0 2000 4000 6000 8000 10000 0 2000 CCA: Mean norm error for w 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 6000 8000 10000 8000 10000 2 1 4000 6000 CCA: Mean norm error for w 1 1 2000 4000 2000 Figure 4.8: Results for the CCA case. 4000 6000 8000 10000 4.8 Experiments 91 MLR: Mean angular error for w1 MLR: Mean angular error for w2 π/2 π/2 π/4 π/4 0 2000 4000 6000 8000 10000 0 2000 MLR: Mean norm error for w 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 6000 8000 10000 8000 10000 2 1 4000 6000 MLR: Mean norm error for w 1 1 2000 4000 2000 Figure 4.9: Results for the MLR case. 4000 6000 8000 10000 92 Low-dimensional linear models and for the ML solution. The results are plotted in figures 4.6, 4.7, 4.8 and 4.9 for PCA, PLS, CCA and MLR respectively. The errors of the algorithm are drawn with solid lines and the errors of the ML solution are drawn with dotted lines. The vertical bars show the standard deviations. Note that the angular error is always positive and, hence, does not have a symmetrical distribution. However, for simplicity, the standard deviation indicators have been placed symmetrically around the mean. The first 30 iterations were omitted to avoid singular matrices when calculating matrix inverses for the ML solutions. No attempt was made to find an optimal set of parameters for the algorithm. Instead, the experiments and comparisons were carried out only to display the behaviour of the algorithm and to show that it is robust and converges to the correct solutions. Initially, the estimate was assigned a small random vector. A constant gain factor of α = 0:001 was used throughout all four experiments. 4.8.2 Performance in high-dimensional signal spaces The purpose of the methods discussed in this chapter is dimensionality reduction in high-dimensional signal spaces. We have previously shown that the proposed algorithm has the computational capacity to handle such signals. This experiment illustrates that the algorithm behaves well also in practice for high-dimensional signals. The dimensionality of x is 800 and the dimensionality of y is 200, so the total dimensionality of the signal space is 1,000. The object in this experiment is CCA. In the previous experiment, the algorithm was used in its basic form with constant update rates set by hand. In this experiment, however, a more sophisticated version of the algorithm is used where the update rate is adaptive and the vectors are averaged over time. The details of this extension of the algorithm are numerous and beyond the scope of this thesis. Only a brief explanation of the basic structure of the extended algorithm is given here. The algorithm can be described in terms of four blocks as illustrated in figure 4.10. The first block, ∆w, calculates the delta-vectors according to ( ∆wx ∆wy = (yT ŵy = (xT ŵx , xT wx )x , yT wy )y (4.96) : The difference between this update rule and the update rule in 4.78 on page 84 is that here the two delta-vectors wx and wy are calculated separately. But the update rule can still be identified as the gradient of ρ in equation 4.25 on page 68 for wx = ŵTx Cxy ŵy ŵ ŵTx Cxx ŵx x and wy = ŵTy Cyx ŵx ŵ . ŵTy Cyy ŵy y 4.8 Experiments 93 wx , wy x, y DCC-SUM ∆w LP wxa , wya α1 f1 f2 α2 c ∆wx , ∆wy CONS Figure 4.10: The extended CCA-algorithm. See the text for explanations. The delta vectors are then accumulated in the DCC-SUM block in a way that compensates for the influence of the DC-component of the sample data: 8 > <wx > :wy = ∑ α1x ∆wx = ∑ α1y ∆wy , ∑α 1x x , ∑ α1x (yT ŵy ,xT wx ) ∑ α1x (4.97) ∑ α1y y ∑ α1y (xT ŵx ,yT wy ) : ∑ α1y Note that the sums can be accumulated on-line. Also note that the update factor α1 can be different for wx and wy . Finally, the weighted averages wxa and wya of wx and wy respectively are calculated: ( wxa (t + 1) wya (t + 1) , wxa(t )) = wya (t ) + α2y (wy (t ) , wya (t )) = wxa (t ) + α2x (wx (t ) (4.98) : Adaptability is necessary for a system without a pre-specified (time dependent) update rate α. Here, the adaptive update rate is dependent on the consistency of the change of the vector. The consistency is calculated in the CONS block as g c = k∆wx k; (4.99) 94 Low-dimensional linear models g where ∆wx is an estimate of the normalized average delta vector: ∆g wx (t + 1) = ∆g wx (t ) + α2x g ∆wx k∆wxk , ∆wx(t ) : (4.100) A similar calculation of c is made for wx . The functions f1 and f2 map the consistency c in a suitable way. f2 increases the sensitivity to changes in c (α2 c2 ) and f1 decreases the sensitivity (α1 c1=2 ). When there is a consistent change in w, c is large and the averaging window is short which makes wa follow w quickly. When the changes in w are less consistent, the window gets longer and wa is the average of an increasing number of instances of w. This means, for example, that if w is moving symmetrically around the correct solution with a constant variance, the error of wa will still tend towards zero (see figure 4.11). The experiment was carried out using a randomly chosen distribution of a 800-dimensional x variable and a 200-dimensional y variable. Two x and two y dimensions were correlated. The other 798 dimensions of x and 198 dimensions of y were uncorrelated. The variances in the 1000 dimensions were of the same order of magnitude. The upper plot in figure 4.11 shows the estimated first canonical correlation as a function of number of iterations (solid line) and the true correlation in the current directions found by the algorithm (dotted line). Note that each iteration gives one sample. The lower plot in figure 4.11 shows the effect of the adaptive averaging. The two upper noisy curves show the logarithms of angular errors of the ‘raw’ estimates wx and wy and the two lower curves show the angular errors for wxa (dashed) and wya (solid). The angular errors of the smoothed estimates are much more stable and decrease more rapidly than the ‘raw’ estimates. The errors after 2 105 samples are below one degree. (It should be noted that this is extreme precision as, with a resolution of 1 degree, a low estimate of the number of different orientations in a 1000-dimensional space is 102000 .) The angular errors were calculated as the angle between the vectors and the exact solutions ê (known from the x y sample distribution), i.e. Err[ŵ] = arccos(ŵTa ê): 4.8 Experiments 95 1.5 correlation 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 iterations 1.4 1.6 1.8 2 x 105 −5 0 0.2 0.4 0.6 0.8 1 1.2 iterations 1.4 1.6 1.8 2 x 105 1 Angle error [ log(rad) ] 0 −1 −2 −3 −4 Figure 4.11: Top: The estimated first canonical correlation as a function of number of iterations (solid line) and the true correlation in the current directions found by the algorithm (dotted line). The dimensionality of one set of variables is 800 and of the second set 200. Bottom: The logarithm of the angular error as a function of number of iterations. 96 Low-dimensional linear models Part II Applications in computer vision Chapter 5 Computer vision In this part of this dissertation is shown how local linear adaptive models based on canonical correlation can be used in computer vision. This chapter serves as an introduction by giving a brief overview of the parts of the theory and terminology of computer vision relevant to the remaining chapters. For an extensive treatment of this subject, see (Granlund and Knutsson, 1995). 5.1 Feature hierarchies An image in a computer is usually represented by an array of picture elements (pixels), each one containing a gray level value or a colour vector. The images referred to in this thesis are gray scale images. The pixel values can be seen as image features on the lowest level. On a higher level, there are for example the orientation and phase of one-dimensional events such as lines and edges. On the next level, the curvature describes the change of orientation. On still higher levels, there are features like shape, relations between objects, disparity et cetera. It is, of course, not obvious how to sort complex features into different levels. But, in general, it can be assumed that a function that estimates the values of a certain feature uses features of a lower level as input. High-level features are often estimated on a larger spatial scale than low-level features. Low-level features (e.g. orientation) are usually estimated by using fairly simple combinations of linear filter outputs. These filter outputs are generated by convolving (see for example Bracewell, 1986) the image with a set of filter kernels. The filter coefficients can be described as a vector and so can each region of the image. Hence, for each position in the image, the filter output can be seen as a scalar product between a (reversed) filter vector and a signal vector. 100 Computer vision qe θ ql Figure 5.1: The phase representation of a line/edge event. 5.2 Phase and quadrature filters Consider a half period of a cosine wave. It can illustrate the cross section of a white1 line if it is centred around 0 and a dark line if it is centred around π. If it is centred around π=2 or 3π=2 it can illustrate lines of opposite slopes. This leads to the concept of phase. To represent the kind of line/edge event in question, a phase angle θ can be used as illustrated in figure 5.1. If the line and edge filters are designed so that they are orthogonal, their outputs, ql and qe respectively, can be combined geometrically so that the magnitude jqj = q2 ql + q2e (5.1) indicates the presence of a line or an edge of a certain orientation and the argument θ = arctan(qe =ql ) (5.2) represents the kind of event in question, i.e. the phase. A filter that fits this representation can be obtained as a complex filter consisting of a real-valued line filter and an imaginary edge filter: q = ql + iqe : 1 White (5.3) is here represented by the highest value and black is represented by the lowest value. 5.3 Orientation 101 The magnitude is then the magnitude of the complex filter output q and the phase is the complex argument of q. If the magnitude is invariant with respect to the phase when applied on a pure sine wave function, the filter is said to be a quadrature filter. A quadrature filter has zero DC component and is zero in one half-plane in the frequency domain. An example of a quadrature filter is shown in figure 7.5 on page 130. It should be noted that the phase can only be defined after defining a direction in which to measure the phase. 5.3 Orientation According to the assumption of local one-dimensionality (see page 52), it can be assumed that a small region of an image generally contains at most one dominant orientation. This orientation can be detected by using a set of at least three quadrature filters evenly spread out over all orientations (Knutsson, 1982). Here, the channel representation discussed in section 3.1 can be recognized. The orientation can be represented by a pure channel vector. If four filter orientations are used, the pure channel vector is 0jq j1 1 B j q jC q=B @jq2jC A (5.4) : 3 jq4 j By choosing a cos2 shape with proper width of the filter functions as described in section 3.1, the channel vector has a constant norm for all orientations. If four filter orientations are used, each channel looks like jqk j = d cos2(ϕk , φ) ; ϕk = (k , 1) π ; 4 (5.5) where ϕk is the filter orientation, φ is the line or line or edge orientation and d is an orientation invariant component. By using this set of channels, a more compact orientation vector can be composed: z= jq j,jq j 1 3 jq2 j,jq4j : (5.6) Inserting equation 5.5 into equation 5.6 gives z=a cos(2φ) sin(2φ) ; (5.7) where a is an orientation invariant component. This orientation representation is called double angle representation (Granlund, 1978). The advantage with this 102 Computer vision q4 z q3 q1 q2 Figure 5.2: The double angle representation. representation can be seen when considering the rotation of a line. A line is identical if it is rotated 180 . Since z rotates 360 as a line rotates 180 , this gives an unambiguous and continuous representation of the orientation as illustrated in figure 5.2. The norm and the orientation of the orientation vector z represent different independent features. While the argument of z represents the orientation of the signal, the norm depends on the energy of the signal in the passband of the filters. The double angle representation enables vector averaging of the orientation estimates. Vector averaging is usually performed to get a more robust orientation estimate in a larger region of the image. Vector averaging is a geometrical summation of the vectors followed by a normalization: v̄ = 1 n vi : n i∑ =1 (5.8) The sum of inconsistently oriented vectors is shorter than the sum of vectors with similar directions. This means that the norm of the average vector can be interpreted as a kind of variance, or certainty, measure. This is an important difference between vector averaging and an ordinary geometric scalar average. If the vector average is normalized using the average norm, i.e. v̄ = ∑ni=1 vi ; ∑ni=1 kvi k (5.9) the certainty measure lies between 0 and 1 where 1 means that all vectors have the same orientation. 5.4 Frequency 103 5.4 Frequency Since the norm of z depends on the frequency content of the signal, it can be used for estimating local (spatial) frequency. While frequency is only strictly defined for stationary signals, which do not hold for most physical signals, the concept of instantaneous frequency (Carson and Fry, 1937; van der Pol, 1946) is usually defined as the rate of change of the phase of the analytical signal (see for example Bracewell, 1986; Granlund and Knutsson, 1995). The instantaneous frequency can be estimated using the ratio between the output of two lognormal quadrature filters (Knutsson, 1982). The radial function of a lognormal filter is defined in the frequency domain by Ri ( f ) = e,CB ln 2 ( f = fi ) (5.10) ; where f = kuk is the norm of the frequency vector, fi is the centre frequency and CB = 4=(B2 ln 2) where B is the 6 dB relative bandwidth. Function 5.10 is a Gaussian on a logarithmic scale. The instantaneous frequency can now be estimated as ωi = jqi 1 j jqi j + (5.11) ; where qi = kqi k is the (orientation invariant) norm of the quadrature filter vector of centre frequency fi and the difference between fi and fi+1 is one octave (i.e. a factor two). An example of this is illustrated in figure 5.3 where the frequency function of two such lognormal filters are plotted (solid curves) together with the quotient in equation 5.11 (dashed line). See Granlund and Knutsson (1995) for further details. To estimate local frequencies in a wider range than that covered by the passbands of two filters, a weighted sum of instantaneous frequencies can be used: f˜ = N ,1 ∑ jqi j !,1 N,1 i=0 pf f ∑ jqi 1 j + i i+1 ; (5.12) i=0 where fi+1 = 2 fi . Also the frequency can be represented by a vector as illustrated in figure 5.4. This enables vector averaging of the frequency estimates too. 5.5 Disparity An important feature of binocular vision systems is disparity, which is a measure of the shift between two corresponding neighbourhoods in a pair of stereo images. 104 Computer vision 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Figure 5.3: The local frequency (dashed line) estimated as a quotient between the magnitude of two lognormal quadrature filter outputs. The centre frequencies of the filters differ one octave. High frequencies High and low frequencies Medium frequencies Low frequencies Figure 5.4: The vector representation of frequency. 5.5 Disparity 105 The disparity is related to the angle the eyes (cameras) must be be rotated relative to each other in order to focus on the same point in the 3-dimensional outside world. The corresponding process is known as vergence. The problem of estimating disparity between pairs of stereo images is not a new one (Barnard and Fichsler, 1982). Early approaches often used matching of some feature in the two images (Marr, 1982). The simplest way to calculate the disparity is to correlate a region in one image with all horizontally shifted regions on the same vertical position and then to find the shift that gave maximum correlation. This is, however, a computationally very expensive method. Since vergence implies a vision system acting in real time, other methods must be employed. Later approaches have been more focused on using the phase information given by for example Gabor or quadrature filters (Sanger, 1988; Wilson and Knutsson, 1989; Jepson and Fleet, 1990; Westelius, 1995). An advantage of phase-based methods is that phase is a continuous variable that allows for sub-pixel accuracy. In phase-based methods, the disparity can be estimated as a ratio between the phase difference between corresponding vertical line/edge filter outputs from the two images and the instantaneous frequency: ∆x = ∆φ ; φ0 (5.13) where φ0 = ω is the instantaneous frequency. Phase-based stereo methods require the filters to be large enough to cover the same structure in the two images, i.e. the shift must be small compared to the wavelength of the filter. Otherwise, the phase difference ∆φ will not be related to the shift. On the other hand, if the shift is too small compared to the wavelength of the filter, the resolution becomes poor which leads to a bad disparity estimate. The disparity algorithm proposed by Wilson and Knutsson (1989) handles this problem by working on a scale pyramid of different image resolutions. It starts by estimating the disparity on a coarse scale, which corresponds to using low-frequency filters, and adjusting the cameras to minimize this disparity. This process is then iterated on consecutively finer scales. A problem that is not solved by that approach is when the observed surface is tilted in depth so that the depth varies along the horizontal axis. In this situation, the surface will be viewed at different scales by the two cameras as illustrated in figure 5.5. This means that phase information on one scale in the left image must be compared with phase information on another scale in the right image. In most stereo algorithms, this problem cannot be handled in a simple way. Another problem that most stereo algorithms are faced with occurs at vertical depth discontinuities (but see Becker and Hinton (1993)). Around the discontinuity there is a region where the algorithm either will not be able to make an estimate 106 Computer vision Figure 5.5: Scaling effect when viewing a tilted plane. at all, or the estimate will be some average between the two correct disparities, indicating a slope rather than a step. Chapter 6 Learning feature descriptors In this chapter is shown how canonical correlation analysis can be used to find models that represent local features in images. Such models can be seen as filters that describe a particular feature in an image. The filters can also be forced to be invariant with respect to certain other features. The features to be described are learned by giving the algorithm examples that are presented in pairs. The pairs are arranged in such a way that the property of a certain feature, for example the orientation of a line, is equal for each pair while other properties, for example phase, are presented in an unordered way. This method was presented at SCIA’97 (Borga et al., 1997a). The idea behind this approach is to use CCA to analyse two signals where the common signal components are due to the feature that is to be represented, as illustrated in figure 6.1. The signal vectors fed into the CCA are image data mapped through some function f . If f is the identity operator (or any other fullrank linear function), the CCA finds the linear combinations of pixel data that have the highest correlation. In this case, the canonical correlation vectors can be seen as linear filters. In general, f can be any vector-valued function of the image data, or even different functions fx and fy , one for each signal space. The choice of f can be seen as the choice of representation of input data for the canonical correlation analysis. As discussed in chapter 3, the choice of representation is very important for the ability to learn. The canonical correlation vectors wx and wy together with the functions fx and fy can be seen as filters. The filters that are developed in this way have the property of maximum correlation between their outputs when applied to two image patches where the represented feature varies simultaneously. In other words, the filters maximize the signal to noise ratio between the desired feature and other signals (see section 4.4.2 on page 70). A more general approach is to try to maximize mutual information instead 108 Learning feature descriptors f CCA f Figure 6.1: A symbolic illustration of the method of using CCA for finding feature detectors in images. The desired feature (here illustrated by a solid line) is varying equally in both image sequences while other features (here illustrated with dotted curves) vary in an uncorrelated way. The input to the CCA is a function f of the image. of canonical correlation. This could be accomplished by changing not only the linear projection in the CCA, but also the functions fx and fy until the maximum correlation ρ is found. This approach relies on the relation between canonical correlation and mutual information discussed in section 4.4.1. The maximum mutual information approach is illustrated in figure 6.2. If fx and fy are parameterized functions, the parameters can be updated in order to maximize ρ. This is related to the work of Becker and Hinton (1992) where f was implemented as neural networks with a single neuron in the output layers. The cost function in their approach was the quotient between the variance of the sum and the variance of the difference of the network outputs. The approach illustrated in 6.2 allows fx and fy to be implemented as neural networks with several units in their output layers. In the work presented here, however, the functions are fixed and identical for x and y. In this chapter, f is the outer product of pixel data or the outer product of quadrature filter outputs. But projection operators can also be useful, as will be seen in chapter 7. The motive for choosing non-linear functions here is that we want to find feature descriptors with useful invariance properties. Of course, also a linear filter is invariant to several changes of the signal. It is, for example, 109 ρ sx fx CCA fy sy Figure 6.2: A general approach for finding maximum mutual information. easy to design a linear filter that is invariant with respect to the mean intensity of the image. But higher-order functions can have more interesting invariance properties, as discussed by Giles and Maxwell (1987) and Nordberg et al. (1994). To see this, consider the output q of a linear filter f for a signal s in one point: q = sT f. The invariance of this filter can be defined as dq = dsT f = 0: (6.1) This means that the changes ds of the signal for which the linear filter is invariant must be orthogonal to the filter. Since the invariance properties of linear filters are very limited, it is natural to try second-order functions, which means that f is an outer product of the pixel data. For a quadratic function F, the output can be written as q = sT Fs. Here, the invariance is defined by dq = 2dsT Fs = 0: (6.2) This expression can, for example, include the invariances in the linear case if F = ffT . But the quadratic filter can also have invariance properties that depend on the signal s and not only on the change ds as in the linear case. An example illustrating the differences between invariances of linear and quadratic functions is illustrated in figure 6.3. In the linear case, the invariances define lines in the two-dimensional case (hyper-planes in general). The lines are orthogonal to f. In the quadratic case, the invariances can define, for example, hyperbolic or parabolic surfaces or ellipsoids. One example of interesting invariance properties of secondorder functions is shift or phase invariance when the filter is applied on a sine wave pattern. This is the case for the norm of the output from a pair of quadrature filters, which is a quadratic function of the pixel data. 110 Learning feature descriptors dsT f = 0 dsT Fs = 0 Figure 6.3: Examples of invariances for linear (left) and quadratic (right) two-dimensional functions. The lines are iso-curves on which the function is constant. A change of the parameter vector s along the lines will not change the output. 6.1 Experiments If f is an outer product and the image pairs contain sine wave patterns with equal orientations but different phase, the CCA should find a linear combination of the outer products that is sensitive with respect to the orientation and invariant with respect to phase. As illustrated in the experiments below, this is also what happens. The outer products weighted by the canonical correlation vectors can be interpreted as outer products of linear filters. As shown in the experiment, these linear filters are approximately quadrature filters, which explains the phase invariance of the product. The findings of quadrature filters in the interpretation of the result of the CCA can serve as a motive for trying products of quadrature filter outputs as input to CCA on a higher level. To simplify the description, two functions are used to reshape a matrix into a vector and the other way around: vec(M) transforms (flattens) an m n matrix M into a vector with mn components (see definition A.1) and mtx(v; m; n) reshapes the vector v into an m n matrix (see definition A.2). In particular, for an m n matrix M, mtx(vec(M); m; n) = M: (6.3) 6.1.1 Learning quadrature filters The first experiment shows that quadrature filters are found by the method discussed above when products of pixel data are presented to the algorithm. 6.1 Experiments 111 1 1 25 5 Ix ix 5 Iy 5 25 iy 5 outer products 1 625 25 25 1 x CCA y 625 25 25 Figure 6.4: Illustration of the generation of input data vectors x and y as outer products of pixel data. See the text for a detailed explanation. Let Ix and Iy be a pair of 5 5 image patches. Each image consists of a sine wave pattern with a frequency of 2π 5 and additive Gaussian noise. A sequence of such image pairs is constructed so that, for each pair, the orientation is equal in the two images while the phase differs in a random way. The images have independent noise. Each image pair is described by vectors ix = vec(Ix ) and iy = vec(Iy ). Let x and y be vectors describing the outer products of the image vectors, i.e. x = vec(ix iTx ) and y = vec(iy iTy ). This gives a sequence of pairs of 625-dimensional vectors describing the products of pixel data from the images. This scheme is illustrated in figure 6.4. The sequence consists of 6,500 examples, i.e. 20 examples per degree of freedom. (The outer product matrices are symmetric and, hence, the number of free 2 parameters is n 2+n where n is the dimensionality of the image vector.) For a signal to noise ratio (SNR) of 0 dB, there were 6 significant1 canonical correlations and 1 By significant, we mean that they differ from the random correlations caused by the limited 112 Learning feature descriptors Projections onto canonical correlation vectors w 1 to w 8 1 wx3, wx4 0.6 wx5, wx6 Canonical correlation 0.8 0.4 wx7, wx8 0.2 0 0 x wx1, wx2 x 5 10 15 Number of correlation 20 0 orientation π Figure 6.5: Left: The 20 largest canonical correlations for the 10 dB SNR sequence. Right: Projections of outer product vectors x onto the 8 first canonical correlation vectors. for an SNR of 10 dB there were 8 significant canonical correlations. The canonical correlations are plotted to the left in figure 6.5. The two most significant correlations for the 0 dB case were both 0.7 which corresponds to an SNR2 of 3.7 dB. For the 10 dB case, the two highest correlations were both 0.989, corresponding to an SNR of 19.5 dB. The projections of image signals x for orientations between 0 and π onto the 8 first canonical correlation vectors wx from the 10 dB case are shown to the right in figure 6.5. The test signals were generated with random phase and without noise. As seen in the figure, the filters defined by the first two canonical correlation vectors are sensitive to the double angle of the orientation of the signal and invariant with respect to phase. The two curves are 90 out of phase and, hence, generate double angle representation (see figure 5.2 on page 102). The following curves show the projections onto the successive canonical correlation vectors with lower canonical correlations. The filters defined by these vectors are sensitive to the fourth, sixth and eighth multiples of the orientation. set of samples. The random correlations, in the case of 20 samples per degree of freedom, is approximately 0.4 (given by experiments). 2 The relation between correlation and SNR in this case is defined by the correlation between two signals with the same SNR, i.e. corr(s + η1 ; s + η2 ). (See section 4.4.2.) 6.1 Experiments 113 Interpretation of the result It is not easy to interpret the 625 coefficients in each canonical correlation vector. But since the data actually were generated as outer products, i.e. 25 25 matrices, the interpretation of the resulting canonical correlation vectors can be facilitated by writing them as 25 25 matrices, Wx = mtx(wx ; 25; 25). This means that the projection of x onto a canonical correlation vector wx can be written as xT wx = iTx Wx ix ; (6.4) where ix is the pixel data vector. By an eigenvalue decomposition of Wx , this projection can be written as T x wx = iTx ∑ j ! λ j e j eTj , 2 ix = ∑ λ j iTx e j ; (6.5) j i.e. a square sum of the pixel data vector projected onto the eigenvectors of Wx weighted with the corresponding eigenvalues. This means that the eigenvectors e j can be seen as liner filters and the curves plotted to the right in figure 6.5 are weighted square sums of the pixel data vectors projected onto the eigenvectors of the matrices Wxi . It turns out that only a few of the eigenvectors give significant contributions to the projection sum in equation 6.5. This can be seen if the terms in the sum are averaged over all orientations of the signal: , 2] m j = E [λ j iTx e j : (6.6) The coefficients m j measure the average energy picked up by the corresponding eigenvectors and can therefore be seen as significance measures for the different eigenvectors. In figure 6.6, the significance measures m j for the 25 eigenvectors are plotted for the two first canonical correlation vectors wx1 and wx2 . Since the projections of x onto the canonical correlation vectors wx can be described in terms of projections of pixel data ix onto a few 25-dimensional eigenvectors e j , these eigenvectors can be used to interpret the canonical correlation vectors. Since the image data ix are collected from 5 5 neighbourhoods Ix , it is logical to view also the eigenvectors e j as 5 5 matrices E j . These matrices can be called eigenimages. The process of extracting eigenimages from a canonical correlation vector is illustrated in figure 6.7. In figure 6.8, the four most significant eigenimages are shown for the first (top) and second (bottom) canonical correlations respectively. The eigenimages can be interpreted as quadrature filter pairs, i.e. filter pairs that have the same spectrum and differ 90 in phase (see section 5.2). For wx1 , 114 Learning feature descriptors 0.4 0.2 0 0 5 10 15 20 25 5 10 15 20 25 0.4 0.2 0 0 Figure 6.6: The significance measures m j for the 25 eigenvectors for the two first canonical correlation vectors wx1 (top) and wx2 (bottom). 1 25 1 5 wx CCA 625 25 Wx ej 25 Ej eigenvalue decomposition Figure 6.7: Illustration of the extraction of 5 5 eigenimages E j from a 625-dimensional canonical correlation vector wx . 5 6.1 Experiments 115 E for w 1 E for w 1 E for w 1 E for w 1 E for w 2 E for w 2 E for w 2 E for w 2 15 12 x x 16 13 x x 17 14 x x 18 16 x x Figure 6.8: The four most significant eigenimages are shown for the first (top row) and second (bottom row) canonical correlations respectively. eigenimages E15 and E16 form a quadrature pair in one direction and eigenimages E17 and E18 form a quadrature pair in the perpendicular direction. The same interpretation can be made for wx2 . To see more clearly that this interpretation is correct, the eigenimage pairs can be combined in the same way as complex quadrature filters, i.e. as one real filter and one imaginary filter with a phase difference of 90 , by multiplying one of the filters3 with i (see section 5.2, page 101). The spectra of the combinations E15 + iE16 and E17 + iE18 for wx1 are shown in the upper row in figure 6.9. In the lower row, the spectra of the combinations E12 + iE14 and E13 + iE16 for wx2 are shown. The DC-component is in the centre of the spectrum. The white circle illustrates the centre frequency of the training signal. The blobs in the figure show that these eight eigenvectors can be interpreted as four quadrature filter pairs in four different directions. 6.1.2 Combining products of filter outputs In this experiment, outputs from neighbouring sets of quadrature filters are used rather than pixel values as input to the algorithm. The experimental result shows that canonical correlation can find a way of combining filter outputs from a local neighbourhood to get orientation estimates that are less sensitive to noise than the 3 Usually, the real filter is symmetric and the imaginary filter is anti-symmetric. However, the choice of offset phase does not matter as long as the filters differ 90 in phase. 116 Learning feature descriptors |F(E 15 |F(E 12 + i*E )|2 for w 1 |F(E + i*E )|2 for w 2 |F(E 16 14 x 17 x 13 + i*E )|2 for w 1 18 x + i*E )|2 for w 2 16 x Figure 6.9: Spectra for the eigenimages interpreted as complex quadrature filter pairs. vector averaging method (see section 5.3 on page 102). Let qxi and qyi , i 2 f1::25g, be 4-dimensional complex vectors of filter responses from four quadrature filters from each of 25 different positions in a 5 5 neighbourhood. The quadrature filters used here have kernels of 7 7 pixels, a centre frequency of 2pπ 2 and a bandwidth of two octaves. Let Xi = qxi qxi and Yi = qyi qyi be the outer products of the filter responses in each position for each image. Finally, all products are gathered into two 400-dimensional vectors: 0 vec(X ) 1 BB vec(X12) C C [ht ]x = B @ ... C A vec(X25 ) This scheme is illustrated in figure 6.10. and 0 vec(Y ) 1 1 B vec ( Y B . 2) C C y=B @ .. C A vec(Y25 ) : (6.7) 6.1 Experiments 117 convolution 1 1 1 4 5 Ix q x1 1 1 qx25 5 4 Iy 5 q y1 5 q y25 outer products 1 4 4 4 X1 X25 400 1 x CCA y 400 4 4 Y1 Y25 4 Figure 6.10: Illustration of the generation of input data vectors x and y as outer products of quadrature filter response vectors from 5 5 neighbourhoods. 118 Learning feature descriptors Angular error using canonical correlation (deg) 90 0 −90 0 90 200 400 600 800 Angular error using vector averaging (deg) 1000 0 −90 0 200 400 600 800 1000 Figure 6.11: Angular errors for 1,000 different samples using canonical correlations (top) and vector averaging (bottom). 8,000 pairs of vectors were generated. For each pair of vectors, the local orientation was equal while the phase and noise differed randomly. Gaussian noise was added to the images giving 0 dB SNR. The data set was analysed using CCA. The two largest canonical correlations were both 0.85. The corresponding vectors detected the double angle of the orientation invariant with respect to phase. New test data were generated using a rotating sine-wave pattern with an SNR of 0 dB and projected onto the first two canonical correlation vectors. The angular error4 is shown in the upper plot in figure 6.11. The lower plot shows the angular error using vector averaging on the same data. The standard deviation of the angular error was 9:4 with the CCA method and 14:8 using vector averaging. This is an improvement of the SNR with 4dB compared to the result when using vector averaging on the same neighbourhood size. 4 The mean angular error is not relevant since it only depends on a reference orientation. The reference orientation can be arbitrarily chosen and, hence, it has been chosen so that the mean angular error is zero. 6.2 Discussion 119 6.2 Discussion In this chapter has been shown how a system can learn image feature descriptors by using canonical correlation. A nice property of the method is that the training is done by giving examples of what the user defines as being “equal”. In the experiments, sine wave patterns were considered to be “equal” if they had the same orientation, irrespectively of the phase. This was presented to the system as a set of examples and the user did not have to figure out how to represent orientation and phase. In the first experiment, the system developed a phase invariant double angle orientation representation. This type of learning is of course more useful for higher-level feature descriptors, were it can be difficult to define proper representations of features. Such features are for example corners and line crossings. In the next chapter, another application of this method is presented, namely disparity estimation, where the horizontal displacement between the images is equal within the training set. 120 Learning feature descriptors Chapter 7 Disparity estimation using CCA An important problem in computer vision that is suitable to handle with CCA is stereo vision, since data in this case naturally appear in pairs. In this chapter, a novel stereo vision algorithm that combines CCA and phase analysis is presented. The algorithm has been presented in a paper submitted to ICIPS’98 (Borga and Knutsson, 1998). For a learning system, the stereo problem is difficult to solve; for small disparities, the high-frequency filters will give the highest accuracy, while for large disparities, the high-frequency filters will be uncorrelated with the disparity and only the low-frequency filters can be used. So the choice of which filters to use for the disparity estimate must be based on a disparity estimate! Furthermore, a general learning system cannot be supposed to know beforehand which inputs come from a certain scale1 . A solution to this problem is to let the system adapt filters to fit the disparity in question instead of using fixed filters. The algorithm described here consists of two parts: CCA and phase analysis. Both are performed for each disparity estimate. Canonical correlation analysis is used to create adaptive linear combinations of quadrature filters. These linear combinations are new quadrature filters that are adapted in frequency response and spatial position in order to maximize the correlation between the filter outputs from the two images. These new filters are then analysed in the phase analysis part of the algorithm. The coefficients given by the canonical correlation vectors are used as weighting coefficients in a pre-computed table that allows for an efficient phase-based search for disparity. In the following two sections, the two parts of the stereo algorithm are de1 This problem is similar to the problem a system faces when learning to interpret numbers that are represented by one digit on each input. Only the most significant digit will have any correlation with the correct number but the correlation will be weak due to the coarse quantization. Only after this digit is identified, it is possible to detect the use of the next digit. 122 Disparity estimation using CCA scribed in more detail. In section 7.3 some experiments are presented to illustrate the performance of the proposed method. Finally, the method is discussed in section 7.4. 7.1 The canonical correlation analysis part The input x and y to the CCA come from the left and right images respectively. Each input is a vector with outputs from a set of quadrature filters: 0q 1 x1 B C x = @ ... A and 0q 1 y1 B C y = @ ... A qxN ; (7.1) qyN where qi is the (complex) filter output for the ith quadrature filter in the filter set. The quadrature filters can bee seen as the functions f in figure 6.1 on page 108. In this case, f is a complex vector-valued linear function, i.e. a complex matrix. In the implementation described here, the filter set consists of two identical onedimensional (horizontal) quadrature filters with two pixels relative displacement. (Other and larger sets of filters can be used including, for example, filters with different bandwidths, different centre frequencies, different positions, etc.) The data are sampled from a neighbourhood N around the point of the disparity estimate. The choice of neighbourhood size is a compromise between noise sensitivity and locality. The covariance matrix C is calculated using the vectors x and y in N . The fact that quadrature filters have zero mean simplifies this calculation to an outer product sum: C= ∑ j2N x x j j yj yj : (7.2) If a rectangular neighbourhood is used, this calculation can be made efficient by a Cartesian separable summation of the outer products as illustrated in figure 7.1. First the outer products are summed in a window moving horizontally for each row. Then this result is summed again by using a window moving vertically for each column. This scheme requires 2 m n additions and subtractions of outer products where m n is the size of the image (except for the borders that are not reached by the centre of the neighbourhood). This can be compared to a straightforward summation over each neighbourhood that requires N m n additions of outer products, where N is the size of the neighbourhood. Hence, for a neighbourhood of 10 10, the separable summation is 50 times faster than a straightforward summation over each neighbourhood. 7.2 The phase analysis part 123 1 2 Figure 7.1: Cartesian separable summation of the outer products. The first canonical correlation ρ1 and the corresponding (complex) vectors wx and wy are then calculated. If the set of filters is small, this is done by solving equation 4.26 on page 68. In the case where only two filters are used, this calculation is very simple. If very large sets of filters are used, an analytical calculation of the canonical correlation becomes computationally very expensive. In such a case, the iterative algorithm presented in section 4.7.3 can be used. The canonical correlation vectors define two new filters: M fx = ∑ wxi fi M and i=1 fy = ∑ wyi fi ; (7.3) i=1 where fi are the basis filters, M is the number of filters in the filter set and wxi and wyi are the components in the first pair of canonical correlation vectors. Due to the properties of canonical correlation, the new filters, fx and fy , have outputs with maximum correlation over N , given the set of basis filters fi . 7.2 The phase analysis part The key idea of this part is to search for the disparity that corresponds to a realvalued correlation between the two new filters. This idea is based on the fact that canonical correlations are real valued (see proof B.4.1 on page 165). In other words, find the disparity δ such that Im [Corr (qy (ξ + δ) ; qx (ξ))] = Im [c(δ)] = 0; (7.4) 124 Disparity estimation using CCA where qx and qy are the left and right filter outputs respectively and ξ is the spatial (horizontal) coordinate. There does not seem to exist a well-established definition of correlation for complex variables. The definition used here (see definition A.3) is a generalization of correlation for real-valued variables similar to the definition of covariance for complex variables. A calculation of the correlation over N for all δ would be very expensive. A much more efficient solution is to assume that the signal s can be described by a covariance matrix Css . Under this assumption, the correlation between the left filter convolved with the signal s and the right filter convolved with the same signal shifted a certain amount δ can be measured. But convolving a filter with a shifted signal is the same as convolving a shifted filter with the non-shifted signal. Hence, the correlation c(δ) can be calculated as the correlation between the left filter convolved with s and a shifted version of the right filter convolved with the same signal s. Under the assumption that the signal s has the covariance matrix Css , the correlation in equation 7.4 can be written as pEE[j[qqxjq2y]E(δ[j)]q j2] x y E (s fx ) (s fy (δ)) = q E (s f ) (s f ) E (s f ) (s f ) c(δ) = x x E [fx s s fy (δ)] = q y y E [fx ss fx ] E fy (δ)ss fy (δ) = (7.5) pffxCCssffyf(δC) ss x y x ss fy ; where fy (δ) is a shifted version of fy . Remember that the quadrature filter outputs have zero mean, which is necessary for the first equality. Note the similarity between the last expression and the expression for canonical correlation in equation 4.24 on page 68. A lot of the computations needed to calculate c(δ) can be saved since f Css fy (δ) = x M ∑ wxi fi ! M Css i=1 M M = ∑∑ ∑ wy j f j (δ) j=1 w wy j f Css f j (δ) = xi ! i i=1 j=1 (7.6) ∑ vi j gi j (δ) ; ij where gi j (δ) = fi Css f j (δ): (7.7) 7.2 The phase analysis part 125 The function gi j (δ) does not depend on the result from the CCA and can therefore be calculated in advance for different disparities δ and stored in a table. The denominator in equation 7.5 can be treated in the same way but does not depend on δ: fx Cssfx = ∑ vxij gi j (0) and ij fy Cssfy = ∑ vyij gi j (0); (7.8) ij where vxij = wxi wx j and vi j = wyi wy j . Note that the filter vectors f must be padded with zeros at both ends to enable the scalar product between a filter and a shifted filter δ. (The zeros do not, of course, affect the result of equation 7.6.) In the case of two basis filters, the table contains four rows and eight constants. Hence, for a given disparity a (complex) correlation c(δ) can be computed as a normalized weighted sum: y c(δ) = ∑i j vi j gi j (δ) q y ∑i j vxij gi j (0) ∑i j vi j gi j (0) : (7.9) The aim is to find the δ for which the correlation c(δ) is real valued. This is done by finding the zero crossings of the phase of the correlation. A very coarse quantization of δ can be used in the table since the phase is, in general, rather linear near the zero crossing (as opposed to the imaginary part which in general is not linear). Hence, first a coarse estimate of the zero crossing is obtained. Then the derivative of the phase at the zero crossing is measured, using two neighbouring samples. Finally, the error in the coarse estimate is compensated for by using the actual phase value and the phase derivative at the estimated position: δ = δc , ϕ(δc ) ; ∂ϕ=∂δ (7.10) where δc is the coarse estimate of the zero crossing and ϕ(δc ) is the complex phase of c(δc ) (see figure 7.2 on the next page). 7.2.1 The signal model If the signal model is uncorrelated white noise, Css is the identity matrix and the calculations of the values in the table reduce to a simple scalar product: gi j (δ) = fi f j (δ). There is no computational reason to choose white noise as signal model if there is a better model, since the table is calculated only once. But it can still be interesting to compare the correlation for white noise with the correlation for another signal model in order to get a feeling for the algorithm’s sensitivity with respect to the signal model. In other words, how does the choice of model affect the position of the zero phase of c(δ)? 126 Disparity estimation using CCA ϕ(δ) ϕ(δc ) δ0 δc δ Figure 7.2: The estimation of the coordinate δ0 of the phase zero crossing using the coarse estimate δc of the zero crossing, the phase value ϕ(δc ) and the derivative at the coarse estimate. The black dots illustrate the sampling points of the phase given by the table gi j (δ). First of all, it should be noted that the denominator in equation 7.5 is real valued and, hence, does not affect the complex phase of c(δ). So, only the numerator c0 (δ) = fx Css fy (δ) (7.11) has to be considered. In general, Css is a Toeplitz matrix (i.e. Ci j = C(i , j)) with the columns (and rows) containing shifted versions of the (non-normalized) autocorrelation function cs of the signal s. This means that f˜x = fx Css can be seen as a convolution of fx with the autocorrelation function cs . But c0 (δ) = f̃x fy (δ) (7.12) , can be seen as a convolution between f̃x and f, y , where fy is fy reversed, since δ 0 only causes a shift of fy . This means that c (δ) can be written as c0 (δ) = (fx cs ) f, y; (7.13) where () denotes convolution. Since the order of convolutions does not matter (convolution is commutative and associative), c0 (δ) can be written as c0 (δ) = (fx f, y ) cs = (fx fy (δ)) cs ; (7.14) i.e. the convolution between fx and f, y can be calculated first. This function can then be convolved with the autocorrelation function to get the correct c0 for the model. This means that the difference between the correlation c(δ) calculated for white noise (i.e. Css is the identity matrix) and the correlation calculated by using 7.2 The phase analysis part 127 π π/2 0 −π/2 −π 0 20 40 60 80 100 Figure 7.3: The phase of the four rows of the table containing gi j (δ) without convolution (solid line) and with convolution with the autocorrelation function 1 ,jξj. another signal model is given by the convolution of c(δ) with the autocorrelation function of the signal model (and an amplitude scaling that does not affect the phase). Hence, if the phase around the zero crossing is anti-symmetric (e.g. linear) in an interval that is large compared to the autocorrelation function of the signal model, the result will be very similar to that obtained for a white noise model. Another lax interpretation of the reasoning above is that as long as the phase does not have bad behaviour around zero, the choice of signal model is not critical. As an example, the phases of the four rows in a table gi j (δ) are plotted in figure 7.3 with and without convolution with the autocorrelation function 1 ,jξj. (Note that two of the rows, g11 and g22 , are equal, which means that only three curves for each case are visible.) This autocorrelation function is usually assumed for natural images. 7.2.2 Multiple disparities If more than one zero crossing are detected, the magnitudes of the correlations can be used to select a solution. Since the CCA searches for maximum correlation, the zero crossing with maximum correlation c(δ) is most likely to be the best estimate. If two zero crossings have approximately equal magnitude (and the canonical correlation ρ is high), both disparity estimates can be considered to be correct within the neighbourhood, which indicates either a depth discontinuity or that there really exist two disparities. 128 Disparity estimation using CCA fx fy Figure 7.4: A simple example of a pair of filters that have two correlation peaks. The latter is the case for semi-transparent images, i.e. images that are sums of images with different depths. Such images are typical of many medical applications such as x-ray images. An every-day example of this kind of image is obtained by looking through a window with reflection. (The effect on the intensity of a light- or X-ray when passing two objects is in fact multiplicative, but a logarithmic transfer function is usually applied when generating X-ray images which makes the images additive.) Note that both disparity estimates are represented by the same canonical correlation solution. This means that the CCA must generate filters that have correlation peaks for two different disparities. To see how this can be done, consider the simple filter pair illustrated in figure 7.4. The autocorrelation function (or convolution) between these two filters is identical to the left filter, which consists of two impulses. The example is much simplified, but illustrates the possibility of having two filters with two correlation peaks. If the CCA was used directly on the pixel data instead of on the quadrature filter outputs, such a filter pair could develop. In the present method, the image data are represented by using other basis functions (the quadrature filters of the basis filter set) but it is still possible to construct filters with two correlation peaks. 7.2.3 Images with different scales If the images are differently scaled, the CCA will try to create filters scaled correspondingly. In order to improve the disparity estimates in these cases, the table can be extended with scaled versions of the basis filters: gi j (σ; δ) = fi Css f j (σ; δ); (7.15) where f j (σ; δ) is a scaled and shifted version of f j . The motive for this is that a scaled signal convolved with a certain filter gives the same result as the non-scaled signal convolved with a reciprocally scaled filter. The CCA step is not affected by this and the phase analysis is performed as described above on each scale. The correct scale is indicated by having the maximum real-valued correlation. 7.3 Experiments 129 The resolution in scale can be very coarse. In the experiments presented in the following section, the filters have been scaled between +/- one octave in steps of a quarter of an octave, which seems to be a quite sufficient resolution. It should be noted that the disparity estimates measured in pixels will differ in the two images since one of the images has a scaled filter as reference. But given the filter scales, the interpretations in terms of depth are of course the same in both images. 7.3 Experiments In this section, some experiments are presented to illustrate the performance of the stereo algorithm. Some results on artificial data are shown. Finally, the algorithm is applied to two real stereo image pairs, both common test objects for stereo images. In all experiments presented here, a basis filter set consisting of two onedimensional horizontally oriented quadrature filters, both with a centre frequency of π=4 and a bandwidth of two octaves has been used. The filters have 15 coefficients in the spatial domain and are shifted two pixels relative to each other. The frequency function is approximately a squared cosine on a log scale: F (u) cos2 (k ln(u=u0 )) ; (7.16) where k = π= (2 ln(2)) and u0 = π=4. The actual filter functions are illustrated in figure 7.5. For the experiments on the artificial data, the neighbourhood for the CCA has been chosen to fit the problem reasonably well. This means that the neighbourhood is longer in the direction of constant disparity than in the direction where the disparity changes. In the real images, a square neighbourhood has been used. How the choice of neighbourhood can be made adaptive is discussed in section 7.4. 7.3.1 Discontinuities The first experiment illustrates the algorithm’s ability to handle depth discontinuities. The test image is made of white noise shifted so that the disparity varies between += , d along the horizontal axis and d varies as a ramp from ,5 pixels to +5 pixels along the vertical axis in order to get discontinuities between += , 10 pixels. A neighbourhood N of 13 7 pixels (horizontal vertical) was used for the CCA. Figure 7.6 shows the estimated disparity for this test image. Disparity estimates with corresponding canonical correlations less than 0.7 have been removed. In figure 7.7, two lines of the disparity estimate are shown. To 130 Disparity estimation using CCA Spectrum of original filter Original filter 1.4 0.3 1.2 0.2 1 0.1 0.8 0.6 0 0.4 −0.1 −0.2 0.2 −5 0 5 0 −π −π/2 0 π/2 Figure 7.5: The filter in the basis filter set. Figure 7.6: Disparity estimate for different depth discontinuities. π 7.3 Experiments 131 5 5 δ 0 δ 0 −5 0 10 20 30 40 50 60 70 80 −5 0 90 0.5 0 0 20 30 40 50 60 70 80 90 10 20 30 40 50 60 horizontal position 70 80 90 1 correlation correlation 1 10 10 20 30 40 50 60 horizontal position 70 80 90 0.5 0 0 Figure 7.7: Top: Line 20 (left) and line 38 (right) from the disparity estimates in figure 7.6 on the facing page. The small dots indicate the disparity estimates with the second strongest correlations. Bottom: The corresponding correlations. the left, line 20 with a disparity of += , 2:5 pixels is shown and to the right, line 38 with a disparity of 1 pixel is shown. The figures at the top show the most likely (large dots) and second most likely (small dots) disparity estimates along these lines. The bottom figures show the corresponding canonical correlations at the zero crossings. Figures 7.6 and 7.7 show that for small discontinuities, the algorithm interpolates the estimates while for large discontinuities, there are two overlapping estimates. An interpolation or fusion for small disparity differences is also performed by the human visual system. The depth interval for which all points are fused into a single image is called Panum’s area (see for example Coren and Ward, 1989). 7.3.2 Scaling The second experiment shows that the algorithm can estimate disparities between images that are differently scaled. The test image here is white noise warped to form a ramp along the horizontal axis. The warping is made so that the right image is scaled to 50% of the original size which means that there is a scale difference of one octave. For a human, this corresponds to looking at a point on a surface with its normal rotated 67 from the observer at a distance of 20 centimetres. In this experiment, a neighbourhood N of 3 31 pixels was used. In figure 7.8 on the next page the results are shown for the basic algorithm without the scaling parameter (left) and for the extended algorithm that searches for the 132 Disparity estimation using CCA Figure 7.8: Disparity estimate for a scale difference of one octave between the images without scale analysis (left) and with scale analysis (right). optimal scaling (right). The lines at the back of the graphs show the mean value. The filters created by the CCA are illustrated in figure 7.9. The left-hand plots show the filters in the spatial domain and the right-hand plots show them in the frequency domain. 7.3.3 Semi-transparent images This experiment illustrates the algorithm’s capability of multiple disparity estimates on semi-transparent images. The test images in this experiment were generated as a sum of two images with white uncorrelated noise. The images were tilted in opposite directions around the horizontal axis. The disparity range was += , 5 pixels. Figure 7.10 illustrates the test scene. The stereo pair is shown in figure 7.11 on page 134. Here, the averaging or fusion performed by the human visual system for small disparities can be seen in the middle of the image. A neighbourhood N of 31 3 pixels was used for the CCA. The result is shown in figure 7.12. In figure 7.13 on page 136, the estimates are projected along the horizontal axis. The results show that the disparities of both the planes are approximately estimated. In the middle, where the disparity difference is small, the result is an average between the two disparities in accordance with the results illustrated in figure 7.7 on the preceding page. 7.3 Experiments 133 Spectrum of left filter Left filter 1 0.2 0.1 0.5 0 −0.1 −0.2 −8 −6 −4 −2 0 2 Right filter 4 6 8 0 −π −π/2 0 π/2 Spectrum of right filter π π/2 π 0.8 0.2 0.6 0.1 0 0.4 −0.1 0.2 −0.2 −8 −6 −4 −2 0 2 4 6 8 0 −π −π/2 0 Figure 7.9: The filter created by CCA. Solid lines show the real parts and dashed lines show the imaginary parts. Figure 7.10: The test image scene for semi-transparent images. 134 Disparity estimation using CCA Left Right Figure 7.11: The stereo image pair for the semi-transparent images. 7.3.4 An artificial scene This experiment tries to simulate a slightly more realistic case where both the discontinuity problem and the scale problem are present. The scene can be thought of as a pole or a tree in front of a wall. Figure 7.14 on page 136 illustrates the scene from above. The distance from the wall to the centre of the tree was 2, the radius of the tree was 1, the distance from the wall to the cameras was 5 and the distance between the cameras was 0.4 length units. A texture of white noise was applied on the wall and on the tree and a stereo pair of images was generated. Each image had the size 200 31 pixels. The generated stereo images are shown in figure 7.15. The disparity was only calculated for one line. Also in this case, a neighbourhood N of 3 31 pixels was used for the CCA. The algorithm was run 100 times on different noise images. The result is illustrated in figure 7.16 on page 137. Close to the edges of the tree, the images are differently scaled. In figure 7.17 on page 138, the average scale difference used by the algorithm is plotted. The scaling can be done in nine steps between +=, one octave and in the figure, the average scaling (in octaves) is plotted. The plot illustrates how the algorithm scales the images relative to each other in one way near the left edge of the tree and in the opposite way at the other edge as expected. There is no scale difference on the background and in the middle of the tree. 7.3.5 Real images The two final experiments illustrate how the algorithm works on real stereo image pairs. In both experiments, a neighbourhood N of 7 7 pixels was used. 7.3 Experiments 135 Disparity 10 0 150 −10 100 100 50 50 Vertical position 0 0 Horizontal position Figure 7.12: The result for the semi-transparent images. The disparity estimates are coloured to simplify the visualization. The first stereo pair is two air photographs of Pentagon (see figure 7.18 on page 139, upper row). The result is shown in the bottom row of the same figure. To the left, the disparity estimates are shown. White means high disparity and black means low disparity. The lower-right image shows a certainty image calculated from the canonical correlation in each neighbourhood. The certainty used here is the logarithm of the SNR according to equation 4.38 on page 71 plus an offset in order to make it positive. The second stereo pair is two images from a well-known image sequence, the Sarnoff tree sequence. This stereo pair is shown in top of figure 7.19 on page 140 and the result and the certainty image are shown at the bottom of the same figure. The results are also illustrated in colour in figures 7.20 on page 141 and 7.21 on page 142. The images at the top are generated so that the colour represents disparity and the intensity represents the original (left) image. The images at the bottom are 3-dimensional surface plots with height and colour representing the disparity estimates. Note that the walls in the pentagon result are depth discontinuities and not just steep slopes. 136 Disparity estimation using CCA 6 4 Disparity 2 0 −2 −4 −6 0 20 40 60 Vertical position 80 100 Figure 7.13: A projection along the horizontal axis of the estimates in figure 7.12 on the preceding page. Figure 7.14: The artificial tree scene from above. Left Right Figure 7.15: The stereo pair generated from the artificial tree scene. 7.3 Experiments 137 Disparity estimate 4 2 0 −2 −4 0 50 100 150 200 50 100 150 200 50 100 Horizontal position 150 200 True disparity 4 2 0 −2 −4 0 Disparity error 5 0 −5 0 Figure 7.16: The result for the artificial test scene. The top graph shows the average disparity estimate. The dotted lines show the standard deviation. The middle graph shows the true disparity. The bottom graph shows the mean disparity error and the standard deviation of the disparity error (dotted line). 138 Disparity estimation using CCA Scale +1/4 0 −1/4 0 50 100 Horizontal position 150 200 Figure 7.17: The average scaling performed by the algorithm. 0 means no scaling, +1=4 means that the left image is scaled +1=4 of an octave, i.e. made smaller compared to the right image. 7.4 Discussion The stereo algorithm described in this chapter is rather different from most image processing algorithms. Common procedures are first to optimize a set of filters and then to use these filters to analyse the image or performing statistical analysis directly on the pixel data. This algorithm, however, first adapts the filters to a local region in the image and then analyses the adapted filters. In all the experiments presented in the preceding section, a filter set of two filters differing only by a shift was used. A larger filter set with filters in different scales would be able to handle larger disparity ranges. If such a set of filters was used, the algorithm would simply select the proper scale to work on, i.e. the scale that has the highest correlation. In general, a larger filter set offers the shapes of the adapted filters more freedom. Hence, a larger filter set should make it easier to handle multiple disparities, depth discontinuities and scale differences if the filter set is chosen properly. A larger filter set covering a wider range of frequencies would also reduce the risk of the signal in a region giving very weak filter output because the filter does not fit the signal. With a larger filter set, the CCA would only use the filters that have a high SNR for the current signal. 7.4 Discussion Figure 7.18: Upper row: Stereo pair of Pentagon. Lower left: Resulting disparity estimates. Lower right: Certainty image of the estimates. 139 140 Disparity estimation using CCA Figure 7.19: Upper row: Stereo pair of the tree scene. Lower left: Resulting disparity estimates. Lower right: Certainty image of the estimates. The filter set can be seen as the basis functions used for representing the signal. The simplest choice of basis functions is the pixels themselves. The canonical correlation vector will then define the filters directly in the pixel base. A disadvantage with such an approach is that the analysis of the filter becomes expensive. The canonical correlation vectors in the experiments presented here were two-dimensional since there were two basis filters. If the pixel basis is used, the dimensionality is equal to the size of the filters that the algorithm is to construct. This means, for example, that if the algorithm should be able to use 1 15 filter kernels, the canonical correlation vectors become 15-dimensional. In other words, the pixel basis is not a good choice of signal representation in this problem (see the discussion in chapter 3). Since we know a better representation for this 7.4 Discussion Figure 7.20: Result for the pentagon images in colour. The upper image displays the disparity estimate as the colour overlaid on the original intensity image. 141 142 Disparity estimation using CCA Figure 7.21: Result for the tree images in colour. The upper image displays the disparity estimate as the colour overlaid on the original intensity image. 7.4 Discussion 143 problem (i.e. quadrature filters), it would be unwise not to use it. The choice of neighbourhood for the CCA is of course important for the result. If there is a priori knowledge of the shape of the regions that have relatively constant depths, the neighbourhood should, of course, be chosen accordingly. This means that if the disparity is known to be relatively constant along the vertical axis, for example, the shape of the neighbourhood should be elongated vertically, as in the experiments on artificial data in the previous section. It is, however, possible to let the algorithm select a suitable neighbourhood shape automatically. This may be done in two ways. One way is to measure the canonical correlation for a few different neighbourhood shapes. These shapes could be, for example, one horizontally elongated, one vertically elongated and one square. The algorithm should then use the result from the neighbourhood that gave the highest canonical correlation to estimate the disparity. Another way to automatically select neighbourhood shape is to begin with relatively small square-shaped neighbourhoods to get a coarse disparity estimate. Then the disparity estimates are segmented. A second run of the algorithm can then use neighbourhood shapes selected according to the shape of the segmented regions. It should be noted that the neighbourhoods can be arbitrarily shaped and even non-connected. The only advantage with a rectangular neighbourhood is that it is computationally efficient when calculating the covariance matrices for the CCA. But if this is utilized in the first run and the covariance matrices are stored, they can simply be added when forming the new larger neighbourhoods in the second run. On the tree image for example, this approach would give vertically elongated neighbourhoods on the tree and horizontally elongated neighbourhoods on the ground. 144 Disparity estimation using CCA Chapter 8 Epilogue In this final chapter, the thesis is summed up and discussed. To conclude, some ideas for future research are presented. 8.1 Summary and discussion The thesis started with a discussion of learning systems. Three different principles of learning were described. Supervised learning can be seen as function approximation. The need for a training set that has an associated set of desired output restricts its use to tasks where such training data can be obtained. Reinforcement learning, on the other hand, is more general than supervised learning and we believe that it is an important general learning principle in complex systems. Its relation to learning among animals and to evolution can support this position. Unsupervised learning is a way of finding a data dependent representation of the signals that is useful according to some criterion. We do not believe that unsupervised learning is the highest general learning principle, since the performance measure these methods are trying to maximize is related only to the internal data representation and has nothing to do with the actual performance of the system in terms of actions. Unsupervised learning can, however, be an important component which helps the system find a good signal representation. It should again be pointed out that the difference between the three learning principles is not so clear as it might seem at first, as discussed in section 2.6. For unsupervised learning, we believe that methods based on maximizing information are important. If nothing else is known about the optimum choice of representation, it is probably wise to preserve as much information (rather than variance, for example) as possible. It is, however, not only the amount of information that is important. The information to be represented must be relevant for the task. In other words, it must be related to information about possibly success- 146 Epilogue ful responses; otherwise it is not useful. This makes methods based on maximum mutual information good candidates. The signal representation needs a model for the represented information. A complex global model is not a realistic choice for large systems that have highdimensional input and output signals. The number of parameters to estimate would be far too large and the structural credit assignment problem would be unsolvable (see section 2.7.2). We believe that local low-dimensional linear models should be used. One reason for this is that only a small fraction of a highdimensional signal space will ever be visited by the signal. Furthermore, this signal is (at least) piecewise continuous because of the dynamic of the real world, which means that the signal can be represented arbitrarily well with local linear models. How to distribute these models is only briefly mentioned in this thesis (section 3.5). The interested reader is referred to the PhD thesis by Landelius (1997) for a detailed investigation of this subject. The choice of local linear models can be made according to different criteria depending on the task. If maximum mutual information is the criterion, canonical correlation analysis is a proper method for finding local linear models. CCA is related to PCA, PLS and MLR which maximize other criteria, statistical or mean square error (see chapter 4). An iterative algorithm for these four methods was presented. The algorithm is more general and actually finds the solutions to the generalized eigenproblem. An important feature of the proposed algorithm is that it finds the solutions successively, beginning with the most significant one. This enables low-rank versions of the solutions of the four methods which is necessary if the signal dimensionality is high. Another nice feature is that the algorithm gives the eigenvector and the corresponding eigenvalues and not only the normalized eigenvectors as is the case with many other iterative methods. It was shown that CCA can be used for learning feature descriptors for computer vision. The proposed method allows the user to define what is equal in two signals by giving the system examples. If other features are varied in an uncorrelated way, the feature descriptors become invariant to these features. An experiment showed that the system learned quadrature filters when it was trained to represent orientation invariant to phase. When quadrature filter outputs are used as input to the system, it learns to combine them in a way that is less sensitive to noise than vector averaging without losing spatial resolution. For a 5 5 neighbourhood, the angular error of the orientation estimate was reduced by 4 dB which is quite a substantial improvement. This method will most likely replace vector averaging in many applications where there is a conflict between the need for noise reduction and spatial resolution. Another application of CCA in computer vision is stereo. A novel stereo algorithm was presented in chapter 7. The algorithm is a bit unusual since first it 8.2 Future research 147 adapts filters to an image neighbourhood an then it analyses the resulting filters. A more common approach in computer vision is first to optimize filters and then to use these filters to analyse the image. Some interesting features of the proposed algorithm are that it can handle depth discontinuities, multiple depths in semitransparent images and image pairs that are differently scaled. Although only one basis filter set with two shifted identical filters has been tested, the results look very promising both on real an artificial images. We believe that the proposed method can be useful also in motion estimation, in particular on x-ray images where there are multiple motions in semi-transparent images. 8.2 Future research There is a number of ideas left open for future research. One interesting question is how to combine reinforcement learning and mutual information based unsupervised learning. A rather ad hoc modification of the canonical correlation algorithm that can handle very high-dimensional signals was presented. Other methods for handling adaptive update factors should be investigated. Preliminary investigations indicate that the RPROP algorithm (Riedmiller and Braum, 1993) can be modified to fit our algorithm. Since the purpose of a gradient based algorithm is to handle very high-dimensional signals, it is important that the algorithm is optimized to handle such cases. The theory for the gradient search method presented in chapter 4 was developed for real valued signals. In chapters 6 and 7, however, we have seen that canonical correlation is useful also when analysing complex-valued signals. Hence, an extension of the theory in chapter 4 to include complex-valued signals is desirable. One of the most interesting issues for future research based on this work is to investigate how canonical correlation can be used in multidimensional signal processing. The experiments in chapter 6 show that phase invariant orientation filters can be learned by using this method. The use of this algorithm for detecting other, higher-level, features should be investigated. Examples of such features are line crossings, corners and even texture. Consider a pair of local neighbourhoods with a given spatial relation as illustrated in figure 8.1. The spatial relation is defined by a displacement vector r. If data are collected from such neighbourhood pairs in a larger region of the image, a CCA would give the linear combination of one neighbourhood that is the most predictable and at the same time the linear combination of the other neighbourhood that is the best predictor. For each displacement, this would give a measure of the best linear relation between the image patches and a description of that relation. This can be performed directly 148 Epilogue Image region y r x Figure 8.1: Illustration of how CCA can be used for generating a texture descriptor by analysing the linear relation between two neighbourhoods x and by with a spatial relationship defined by the displacement vector r. on the pixel data or on a filtered image. Consider, for example, a sine wave pattern without noise. The canonical correlation for such an image would be one for all displacements between the neighbourhoods. This is logical, since the pattern is totally predictable. An ordinary correlation analysis, however, would give zero correlations where the phase of the patterns differs 90 . A matrix containing the largest canonical correlations for different neighbourhood displacements defines the displacement vectors for which the patterns are linearly predictable. Instead of the matrix, a tensor that contains the canonical correlation vectors can be used. Such a tensor would be a descriptor of the texture. The use of such descriptors in texture analysis should be investigated. The generalization of the canonical correlation method aiming to find maximum mutual information as illustrated in figure 6.2 on page 109 should be investigated. The non-linear functions fx and fy can be implemented as neural networks with, for example, sigmoid or radial-basis functions. The neural networks are trained, for example using back-propagation, in order to maximize the canonical correlation ρ. The method in chapter 6 is, of course, not limited to image data. Another interesting application is speech recognition, where it is important to be invariant with respect to how the words are pronounced. Another very interesting issue is the extension of the stereo algorithm in chap- 8.2 Future research 149 ter 7 in order to estimate both vertical and horizontal shifts, i.e. two-dimensional translations of the image. If the neighbourhoods are taken from different frames in a temporal image sequence, the extended algorithm could be used for motion estimation. The capability of handling multiple estimates in semi-transparent images would make this method interesting in medical applications. The problem of estimating multiple motions exists for example in x-ray image sequences, where different parts of the body move in different ways. The capability of handling scaling between the images would make it possible to handle motions more complex than pure translations, for example 3-dimensional rotations and deformations. 150 Epilogue Appendix A Definitions In this appendix, some useful non-standard functions are defined. “,” means “equal by definition”. A.1 The vec function Consider an m n matrix M: M = [m1 m2 ::: mn ] (A.1) where the columns mi are m-dimensional vectors. Then 0m 1 1 B m B 2C C v = vec(M) , B . C @ .. A (A.2) mn A.2 The mtx function Consider an mn-dimensional vector v. Then M = mtx(v; m; n) , [m1 m2 where the columns mi are m-dimensional vectors. ::: mn ] (A.3) 152 A.3 Definitions Correlation for complex variables Consider x, y 2 C 1 with mean x̄ and ȳ respectively. The correlation between x and y is defined as Corr(x; y) = pEE[[(jxx,,x̄x̄j2)(] yE,[jyȳ,) ]ȳj2] (A.4) Appendix B Proofs This appendix contains all the proofs referred to in the text. B.1 Proofs for chapter 2 B.1.1 The differential entropy of a multidimensional Gaussian variable h(z) = , 1 log (2πe)N jCj 2 (2.41) ; where jCj is the determinant of the covariance matrix of z and N is the dimensionality of z. Proof: The Gaussian distribution for an N-dimensional variable z is p(z) = p(2π1)N jCj e, z C, z 1 T 2 1 (B.1) The definition of differential entropy (equation 2.38 on page 29) then gives Z h(z) = , p(z) log p(z) dz RN Z 1 T ,1 (B.2) = p(z) log (2π)N jCj + z C z dz 2 RN N 1 N = log (2π)N jCj + = log (2πe) jCj : 2 2 q q , Here, we have used the fact that Z p(z)zC,1 z dz = E [zT C,1 z] = E [tr(zzT C,1 )] = N RN (B.3) 154 Proofs B.2 Proofs for chapter 3 B.2.1 The constant norm of the channel set ∑8k jck j2 = constant where 8 2 ,π <cos 3 (x , k) ck = :0 if jx , kj < 3 2 (3.2) otherwise (page 41). Proof: Consider the interval , π6 < x π6 . On this interval, all channels are zero except for k = ,1; 0; 1. Hence, it is sufficient to sum over these three channels. i i h i h π π 4 π (x , 1) + cos x + cos4 (x + 1) 3 3 3 2 2 2 ! 2π 2π 2π 1 + cos x + 1 + cos (x , 1) + 1 + cos (x + 1 ) 3 3 3 ∑ jck j2 = cos4 = 1 4 h 1 2π 2π 2 2π = 3 + cos2 x + cos2 (x , 1) + cos (x + 1) 4 3 3 3 2π 2π 2π (x , 1) + 2 cos (x + 1 ) + 2 cos x + 2 cos 3 3 3 = 1 4 3 + cos2 2π x 3 + cos 2π 2π x cos 3 3 + sin 2π 2π x sin 3 3 2 2π 2π 2π 2π 2 2π + cos x cos , sin x sin + 2 cos x 3 3 3 3 3 2π 2π 2π 2π + 2 cos x cos + sin x sin 3 3 3 3 2π 2π 2π 2π + 2 cos x cos , sin 3 x sin 3 3 3 0 1 2 2π = @3 + cos x 4 3 + , 12 cos 2π3 x + , 1 2π + 2 , cos x 2 3 p !2 3 2π sin x 2 3 p + p , 12 cos 2π x 3 + 2π + 2 cos x 3 ! 3 2π sin x 2 3 !2 3 2π sin x 2 3 1 2π + 2 , cos x 2 3 , p 3 2π sin x 2 3 !! B.2 Proofs for chapter 3 1 = 4 2 3 + cos 2π x 3 155 1 2 2π + cos x 4 3 3 2 2π + sin x 4 3 p , p 3 2π 2π cos x sin x 2 3 3 1 3 2π 3 2π 2π 2π 2 2π + cos x + sin2 x + cos x sin x + 2 cos x 4 3 4 3 2 3 3 3 p p x , cos 2π3 x + 3sin 2π3 x , cos 2π3 x , 3 sin 2π 3 1 2π 1 2π 3 2π = 3 + cos2 x + cos2 x + sin2 x 4 3 2 3 2 3 9 = : 8 This case can be generalized for any x that is covered by three channels of this shape that are separated by π3 . B.2.2 The constant norm of the channel derivatives d ck j2 = constant (page 41). ∑8k j dx Proof: The derivative of a channel k with respect to x is h i h 2π π π d ck = , cos (x , k) sin (x , k) dx 3 3 3 i : 156 Proofs The sum is then ∑ 2 π i 2 hπ i x sin x 3 3 h i h i 2 π 2 π + cos (x , 1) sin (x , 1) 3 3 h i h i 2 π 2 π (x + 1) sin (x + 1 ) + cos 3 3 2 π 2π 2 2 2π 2 2π = sin x + sin (x , 1) + sin (x + 1) 3 3 3 3 d 2 ck = dx = 2π 3 cos2 h π 2 2π x sin2 3 3 2 + sin 2π x 3 + + sin 2π 2π sin x cos 3 3 2π 2π x cos 3 3 , cos 2π 2π + cos x sin 3 3 2π 2π x sin 3 3 2 2 ! π 2 2π 2π 2π 2π 2 2π sin2 x + 2 sin2 x cos2 x sin2 + cos 3 3 3 3 3 3 2 π 2π 3 2π 1 2π sin2 x +2 cos2 x + sin2 x = 3 3 4 3 4 3 2 π 3 2π 2π x + sin2 x = cos2 3 4 3 3 = = B.2.3 π2 : 12 Derivation of the update rule for the prediction matrix memory r = p + akqk2 kvk2 Proof: get By inserting equation 3.17 on page 49 into equation 3.14 on page 49, we r = = = = B.2.4 (3.18) hW + aqvT j qvT i hW j qvT i + ahqvT j qvT i p + a(qT qvT v) p + akqk2 kvk2 : One frequency spans a 2-D plane One frequency component defines an ellipse and, hence, spans a two-dimensional plane (page 51). B.3 Proofs for chapter 4 Proof: 157 Consider a signal with frequency ω in an n-dimensional space: 0a sin(ωt + α )1 0a cos α 1 0a sin α 1 1 1 1 1 1 1 B B BBa2 sin(ωt + α2)CC C C a cos α a sin α 2 2 2 2C B C B sin ( ωt ) + cos ( ωt ) = [email protected] C B C B .. A @ ... A @ ... C A . an sin(ωt + αn ) an cos αn an sin αn (B.4) = v1 sin(ωt ) + v2 cos(ωt ) Remark: It should be noted that the two-dimensionality is caused by the different phases αi . If all components have the same phase, the signal spans only one dimension. B.3 Proofs for chapter 4 B.3.1 Orthogonality in the metrics A and B ( ŵTi Bŵ j = Proof: 0 for βi > 0 for i 6= j i= j ( and ŵTi Aŵ j = 0 ri βi for for i 6= j i= j (4.6) For solution i we have Aŵi = ri Bŵi : (B.5) The scalar product with another eigenvector gives ŵTj Aŵi = ri ŵTj Bŵi (B.6) ŵTi Aŵ j = r j ŵTi Bŵ j : (B.7) and of course also Since A and B are Hermitian we can change positions of ŵi and ŵ j which gives r j ŵTi Bŵ j = ri ŵTi Bŵ j (B.8) and hence (ri , r j )ŵTi Bŵ j = 0 : (B.9) 158 Proofs For this expression to be true when i 6= j, we have that ŵTi Bŵ j = 0 if ri 6= r j . For i = j we now have that ŵTi Bŵi = βi > 0 since B is positive definite. In the same way we have 1 1 , ri r j which means that ŵTi Aŵ j ri ŵTi Bŵi = ri βi . B.3.2 = 0 for i ŵTi Aŵ j = 0; 6= j. For i = (B.10) j we know that ŵTi Aŵi = Linear independence fwi g are linearly independent. Proof: Suppose fwi g are not linearly independent. This would mean that we could write an eigenvector wk as ŵk = ∑ γ j ŵ j j6=k (B.11) : This means that for j 6= k, wTj Bwk = γ j wTj Bw j 6= 0 (B.12) which violates equation 4.6 on page 63. Hence, fwi g are linearly independent. B.3.3 The range of r rn r r1 Proof: (4.7) If we express a vector w in the base of the eigenvectors ŵi , i.e. w = ∑ γi ŵi ; (B.13) i we can write r= ∑ γi ŵTi A ∑ γi ŵi ∑ γi ŵTi B ∑ γi ŵi = ∑ γ2i αi ; ∑ γ2i βi (B.14) B.3 Proofs for chapter 4 159 where αi = ŵTi Aŵi and βi = ŵTi Bŵi , since ŵTi Aŵ j = ŵTi Bŵ j = 0 for i 6= j. Now, since αi = βi ri (see equation 4.6 on page 63), we get r= ∑ γ2i βi ri : ∑ γ2i βi (B.15) Obviously this function has the maximum value r1 when γ1 6= 0 and γi = 0 8 i > 1 if r1 is the largest eigenvalue. The minimum value, rn , is obtained when γn 6= 0 and γi = 0 8 i < n if rn is the smallest eigenvalue. B.3.4 The second derivative of r Hi = Proof: tive as ∂2 r ∂w2 = ∂2 r ∂w2 w = =ŵi 2 (A ŵTi Bŵi , riB) (4.8) From the gradient in equation 4.3 on page 61 we get the second deriva- 2 (wT Bw)2 ∂r T A, w B , rB wT Bw , (Aw , rBw)2wT B ∂w : (B.16) If we insert one of the solutions ŵi , we have ∂r ∂w w =ŵi and hence ∂2 r ∂w2 B.3.5 = w 2 (Aŵi , rBŵi ) = 0 ŵTi Bŵi (B.17) 2 (A , ri B) : ŵTi Bŵi (B.18) =ŵi = Positive eigenvalues of the Hessian There exists a w such that w T Hi w > 0 8i > 1 (4.9) 160 Proof: get Proofs If we express a vector w as a linear combination of the eigenvectors we βi T w Hi w = wT (A , ri B)w 2 T ,1 = w B(B A , ri I)w ∑ γ j ŵTj B(,B,1A , riI) ∑ γ j ŵ j T = ∑ γ j ŵ j B ∑ r j γ j ŵ j , ∑ ri γ j ŵ j T = ∑ γ j ŵ j B ∑(r j , ri )γ j ŵ j 2 = ∑ γ j β j (r j , ri ) = (B.19) ; where βi = ŵTi Bŵi > 0. Now, (r j , ri ) > 0 for j < i so if i > 1 there is at least one choice of w that makes this sum positive. B.3.6 The partial derivatives of the covariance ( ∂ρ ∂wx ∂ρ ∂wy Proof: = kw1 k (Cxy ŵy x 1 = kw k (Cyx ŵx y (4.17) : The partial derivative of ρ with respect to wx is ∂ρ ∂wx = = = Cxy wy kwx kkwy k, wTx Cxy wy kwx k,1 wx kwy k kwx k2kwy k2 Cxy ŵy ρwx kwx k , kwx k2 1 kwx k (Cxy ŵy , ρŵx) The same calculations can be made for B.3.7 , ρŵx) , ρŵy) ∂ρ ∂wy by exchanging x and y . The partial derivatives of the correlation 8 ∂ρ > < ∂w > : ∂w∂ρ x a = kw k x y a = kw k y ŵ C Cxy ŵy , ŵ C ŵ C T x T x xy ŵy Cxx ŵx xx ŵx Cyx ŵx , ŵyT Cyxyy ŵxy Cyy ŵy T y ŵ (4.25) B.3 Proofs for chapter 4 Proof: ∂ρ ∂wx 161 The partial derivative of ρ with respect to wx is = T 1=2 C w (wT xy y x Cxx wx wy Cyy wy ) wTx Cxx wx wTy Cyy wy wTx Cxy wy (wTx Cxx wx wTy Cyy wy ),1=2 Cxx wx wTy Cyy wy , wTx Cxx wx wTy Cyy wy ,1=2 T T = (wx Cxx wx wy Cyy wy ) wT Cxy wy Cxy wy , Tx Cxx wx wx Cxx wx T ,1 T ,1=2 Cxy ŵy , ŵx Cxy ŵy Cxx ŵx T = kwx k (ŵx Cxx ŵx ŵy Cyy ŵy ) T = | {z } 0 ŵx Cxx ŵx ŵT Cxy ŵy a Cxy ŵy , xT Cxx ŵx kwx k ŵx Cxx ŵx The same calculations can be made for B.3.8 ∂ρ ∂wy a 0: ; by exchanging x and y . Invariance with respect to linear transformations Canonical correlations are invariant with respect to linear transformations. Proof: Let x = Ax x0 and y = Ay y0 ; (B.20) where Ax and Ay are non-singular matrices. If we denote C0 xx = E [x0 x0 T ]; (B.21) the covariance matrix for x can be written as Cxx = E [xxT ] = E [Ax x0 x0 ATx ] = Ax C0 xx ATx : T (B.22) In the same way we have Cxy = Ax C0 xy ATy and Cyy = Ay C0 yy ATy : (B.23) Now, the equation system 4.26 on page 68 can be written as ( ATx C0xy Ay ŵy ATy C0yx Ax ŵx 0 = ρλx AT x Cxx Ax ŵx 0 = ρλy AT y Cyy Ay ŵy (B.24) 162 Proofs ( or C0xy ŵ0y C0yx ŵ0x = ρλx C0xx ŵ0x (B.25) = ρλy C0yy ŵ0y ; where ŵ0x = ATx ŵx and ŵ0y = ATy ŵy . Obviously this transformation leaves the roots ρ unchanged. If we look at the canonical variates, (0 x y0 T ,1 = w0 x x0 = wT x Ax Ax x = x T ,1 = w0 y y0 = wT y Ay Ay y = y; (B.26) we see that these too are unaffected by the linear transformation. B.3.9 Relationship between mutual information and canonical correlation I (x; y) = 1 1 log 2 ∏i (1 , ρ2i ) (4.32) ; where x and y are N-dimensional Gaussian variables and ρi are the canonical correlations. Proof: The differential entropy of a multidimensional Gaussian variable is h(z) = , 1 log (2πe)N jCj 2 (B.27) ; where jCj is the determinant of the covariance matrix of z and N is the dimensionality of z (see proof B.1.1 on page 153). If z = xy , the covariance matrix C can be written as , C= C xx Cyx Cxy Cyy (B.28) : By using the relation jCj = jCxx j jCyy , CyxCxx,1Cxy j (B.29) (Kailath, 1980, page 650) and equation 2.42 on page 30, we get jCxx j jCyy j 1 I (x; y) = log 2 jCj jCyy , CyxC,xx1Cxy j 1 = , log 2 jCyyj 1 ,1 ,1 = , log jI , Cyy Cyx Cxx Cxy j 2 , (B.30) B.3 Proofs for chapter 4 163 assuming the covariance matrices Cxx and Cyy being non-singular. The eigenval1 ,1 ues to C, yy Cyx Cxx Cxy are the squared canonical correlations (see equation 4.28 on page 68). Hence, an eigenvalue decomposition gives 2ρ2 6 1 ρ2 1 6 2 I (x; y) = , log I , 6 4 2 0 1 1 = 2 3 77 1 75 = , 2 log ∏(1 , ρ2i ) i ρn 0 .. . (B.31) ∏i (1 , ρ2i ) log since the eigenvalue decomposition does not change the identity matrix. B.3.10 The partial derivatives of the MLR-quotient 8 ∂ρ < ∂w : ∂w∂ρ Proof: ∂ρ ∂wx , βCxx ŵx) Cyx ŵx , ρβ ŵy x a = kw k (Cxy ŵy x y a = kw k x (4.44) 2 : The partial derivative of ρ with respect to wx is = T 1=2 C w (wT xy y x Cxx wx wy wy ) wTx Cxx wx wTy wy , wTx Cxy wy (wTx Cxx wx wTy wy ),1=2 Cxx wx wTy wy wTx Cxx wx wTy wy T T ,1=2 = (wx Cxx wx wy wy ) = = wT Cxy wy Cxy wy , xT Cxx wx wx Cxx wx kwx k,1(ŵTx Cxx ŵxŵTy ŵy ),1=2 | {z 0 } a kwx k (Cxyŵy , βCxx ŵx) ; Cxy ŵy , a 0: ŵTx Cxy ŵy Cxx ŵx ŵTx Cxx ŵx 164 Proofs The partial derivative of ρ with respect to wy is ∂ρ ∂wy = T 1=2 C w (wT yx x x Cxx wx wy wy ) wTx Cxx wx wTy wy T T ,1=2 wT Cxx wx wy T , wx Cxy wy(wx CwxxT Cwxwwy wwyT)w xx x kwy k,1(|ŵTx C{zxx ŵ}x ),1=2 = B.3.11 0 a ρ2 Cyx ŵx , ŵy kwx k β y y wT Cxy wy wTx Cxx wx Cyx wx , x T wy wx Cxx wx wTy wy T T ,1=2 = (wx Cxx wx wy wy ) = x x ,C yx ŵx ; , ŵTx Cxy ŵy ŵy ! a 0: The successive eigenvalues H = G , λ1ê1 fT1 (4.59) Proof: Consider a vector u which is expressed as the sum of one vector parallel to the eigenvector ê1 , and another vector uo that is a linear combination of the other eigenvectors and, hence, orthogonal to the dual vector f1 . u = aê1 + uo; (B.32) where fT1 ê1 = 1 and Multiplying H with u gives , Hu = G , λ1ê1 fT1 = a (Gê1 fT1 ûo = 0: (aê 1 + uo ) , λ1ê1 ) + (Guo , 0) (B.33) = Guo : This shows that G and H have the same eigenvectors and eigenvalues except for the largest eigenvalue and eigenvector of G. Obviously the eigenvector corresponding to the largest eigenvalue of H is ê2 . B.4 Proofs for chapter 7 165 B.4 Proofs for chapter 7 B.4.1 Real-valued canonical correlations The canonical correlations ρi are real valued. Proof: ,1Cxy C,1 Cyx : The canonical correlations are eigenvalues to the matrix Cxx yy ,1Cxy C,1 Cyx wx = ρi wx Cxx yy (B.34) ,1 is Hermitian and positive definite. B = Cxy C,1Cyx is Hermitian. Then A = Cxx yy 1 ,1 AB = C, xx Cxy Cyy Cyx is Hermitian (see proof B.4.2) and, hence, got real-valued eigenvalues. 1 It should be noted that if A = C, xx only is positive semidefinite, A and B can be projected into a subspace spanned by the eigenvectors of A corresponding to the non-zero eigenvalues. This will give two new matrices A0 and B0 with the same non-zero eigenvalues as A and B but with A0 positive definite. In this way it can be shown that all non-zero correlations are real valued. B.4.2 Hermitian matrices If A is Hermitian and positive definite and B is Hermitian then AB is Hermitian. Proof: By writing the singular value decomposition A = U DU we see that also C = U D1=2 U = A1=2 (B.35) is Hermitian and positive definite. Then CBC = C B C = (CBC) (B.36) is Hermitian. But CBC and AB has got the same eigenvalues since AB = C2 B = C(CBC)C,1 is only a change of basis which does not change the eigenvalues. (B.37) 166 Proofs Bibliography Anderson, J. A. (1972). A simple neural network generating an interactive memory. Mathematical Biosciences, 14:197–220. Anderson, J. A. (1983). Cognitive and psychological computation with neural models. IEEE Transactions on Systems, Man, and Cybernetics, 14:799–815. Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, second edition. Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning: Proceedings of the Twelfth International Conference, San Francisco, CA. Armand Prieditis and Stuart Russell, eds. Baker, W. L. and Farell, J. A. (1992). Handbook of intelligent control, chapter An introduction to connectionist learning control systems, pages 35–63. Van Nostrand Reinhold, New York. Ballard, D. H. (1987). Vision, Brain, and Cooperative Computation, chapter Cortical Connections and Parallel Processing: Structure and Function. MIT Press. M. A. Arbib and A. R. Hanson, Eds. Ballard, D. H. (1990). Computational Neuroscience, chapter Modular Learning in Hierarchical Neural Networks. MIT Press. E. L. Schwartz, Ed. Barlow, H. (1989). Unsupevised learning. Neural Computation, 1:295–311. Barlow, H. B., Kaushal, T. P., and Mitchson, G. J. (1989). Finding minimum entropy codes. Neural Computation, 1:412–423. Barnard, S. T. and Fichsler, M. A. (1982). Computational Stereo. ACM Comput. Surv., 14:553–572. Barto, A. G. (1992). Handbook of Intelligent Control, chapter Reinforcement Learning and Adaptive Critic Methods. Van Nostrand Reinhold, New York. D. A. White and D. A. Sofge, Eds. 168 Bibliography Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics, SMC-13(8):834–846. Battiti, R. (1992). First and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation, 4:141–166. Becker, S. (1996). Mutual information maximization: models of cortical selforganization. Network: Computation in Neural Systems, 7:7–31. Becker, S. and Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(9):161–163. Becker, S. and Hinton, G. E. (1993). Learning mixture models of spatial coherence. Neural Computation, 5(2):267–277. Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129–59. Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ. Bloom, F. E. and Lazerson, A. (1985). Brain, Mind, and Behavior. W. H. Freeman and Company. Bock, R. D. (1975). Multivariate Statistical Methods in Behavioral Research. McGraw-Hill series in psychology. McGraw-Hill. Borga, M. (1993). Hierarchical Reinforcement Learning. In Gielen, S. and Kappen, B., editors, ICANN’93, Amsterdam. Springer-Verlag. Borga, M. (1995). Reinforcement Learning Using Local Adaptive Models. Thesis No. 507, ISBN 91–7871–590–3. Borga, M. and Knutsson, H. (1998). An adaptive stereo algorithm based on canonical correlation analysis. Submitted to ICIPS’98. Borga, M., Knutsson, H., and Landelius, T. (1997a). Learning Canonical Correlations. In Proceedings of the 10th Scandinavian Conference on Image Analysis, Lappeenranta, Finland. SCIA. Borga, M., Landelius, T., and Knutsson, H. (1997b). A unified approach to PCA, PLS, MLR and CCA. Information Sciences. Submitted. Revised for second review. Bibliography 169 Bower, G. H. and Hilgard, E. R. (1981). Theories of Learning. Prentice–Hall, Englewood Cliffs, N.J. 07632, 5 edition. Bracewell, R. (1986). The Fourier Transform and its Applications. McGraw-Hill, 2nd edition. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In Advances in Neural Information Processing Systems 5, San Mateo, CA. Morgan Kaufmann. Bregler, C. and Omohundro, S. M. (1994). Surface learning with applications to lipreading. In Advances in Neural Information Processing Systems 6, pages 43–50, San Francisco. Morgan Kaufmann. Brooks, V. B. (1986). The Neural Basis of Motor Control. Oxford University Press. Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321–355. Carson, J. and Fry, T. (1937). Variable frequency electric circuit theory with application to the theory of frequency modulation. Bell Syste Tech. J., 16:513– 540. Comon, P. (1994). Independent component analysis, a new concept? Processing, 36(3):287–314. Signal Coren, S. and Ward, L. M. (1989). Sensation & Perception. Harcourt Brace Jovanovich, Publishers, San Diego, USA, 3rd edition. ISBN 0–15–579647–X. Das, S. and Sen, P. K. (1994). Restricted canonical correlations. Linear Algebra and its Applications, 210:29–47. Davis, L., editor (1987). Genetic Algorithms and Simulated Anealing. Pitman, London. Denoeux, T. and Lengellé, R. (1993). Initializing back propagation networks with prototypes. Neural Networks, 6(3):351–363. Derin, H. and Kelly, P. A. (1989). Discrete-index markov-type random processes. In Proceedings of IEEE, volume 77. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. Wiley-Interscience, New York. 170 Bibliography Fieguth, P. W., Irving, W. W., and Willsky, A. S. (1995). Multiresolution model development for overlapping trees via canonical correlation analysis. In International Conference on Image Processing, pages 45–48, Washington DC. IEEE. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation. in press. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics, 7(Part II):179–180. Also in Contributions to Mathematical Statisitcs (John Wiley, New York, 1950). Földiák, F. (1990). Forming sparse representations by local anti-hebbian learning. Biological Cybernetics. Fletcher, R. and Reeves, C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7:149–154. Geladi, P. and Kowalski, B. R. (1986). Parial least-squares regression: a tutorial. Analytica Chimica Acta, 185:1–17. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58. Giles, G. L. and Maxwell, T. (1987). Learning, invariance, and generalization in high-order neural networks. Applied Optics, 26(23):4972–4978. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. Golub, G. H. and Loan, C. F. V. (1989). Matrix Computations. The Johns Hopkins University Press, second edition. Granlund, G. H. (1978). In search of a general picture processing operator. Computer Graphics and Image Processing, 8(2):155–178. Granlund, G. H. (1988). Integrated analysis-response structures for robotics systems. Report LiTH–ISY–I–0932, Computer Vision Laboratory, Linköping University, Sweden. Granlund, G. H. (1989). Magnitude representation of features in image analysis. In The 6th Scandinavian Conference on Image Analysis, pages 212–219, Oulu, Finland. Bibliography 171 Granlund, G. H. (1997). From multidimensional signals to the generation of responses. In Sommer, G. and Koenderink, J. J., editors, Algebraic Frames for the Perception-Action Cycle, volume 1315 of Lecture Notes in Computer Science, pages 29–53, Kiel, Germany. Springer-Verlag. International Workshop, AFPAC’97, invited paper. Granlund, G. H. and Knutsson, H. (1982). Hierarchical processing of structural information in artificial intelligence. In Proceedings of 1982 IEEE Conference on Acoustics, Speech and Signal Processing, Paris. IEEE. Invited Paper. Granlund, G. H. and Knutsson, H. (1983). Contrast of structured and homogenous representations. In Braddick, O. J. and Sleigh, A. C., editors, Physical and Biological Processing of Images, pages 282–303. Springer Verlag, Berlin. Granlund, G. H. and Knutsson, H. (1990). Compact associative representation of visual information. In Proceedings of The 10th International Conference on Pattern Recognition. Report LiTH–ISY–I–1091, Linköping University, Sweden, 1990. Granlund, G. H. and Knutsson, H. (1995). Signal Processing for Computer Vision. Kluwer Academic Publishers. ISBN 0-7923-9530-1. Gray, R. M. (1984). Vector quaantization. IEEE ASSP Magazine, 1:4–29. Gray, R. M. (1990). Entropy and Information Theory. Springer-Verlag, New York. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3:671–692. Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. Macmillan College Publishing Company. Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. Heger, M. (1994). Consideration of risk in reinforcent learning. In Cohen, W. W. and Hirsh, H., editors, Proceedings of the 11th International Conference on Machine Learning, pages 105–111, Brunswick, NJ. Held, R. and Bossom, J. (1961). Neonatal deprivation and adult rearrangement. Complementary techniques for analyzing plastic sensory–motor coordinations. Journal of Comparative and Physiological Psychology, pages 33–37. Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley. 172 Bibliography Hinton, G. E. and Nowlan, S. J. (1987). How learning can guide evolution. Complex Systems, pages 495–502. Hinton, G. E. and Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 448–453, Washington DC. Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In Rummelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in Microstructures of Cognition. MIT Press, Cambridge, MA. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational capabilities. Proceedings of the National Academy of Sciences, 79:2554–2558. Hornby, A. S. (1989). Oxford Advanced Learner’s Dictionary of Current English. Oxford University Press, Oxford, fourth edition. A. P. Cowie (ed.). Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417–441, 498–520. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28:321– 377. Höskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics, 2:211–228. Hubel, D. H. (1988). Eye, Brain and Vision, volume 22 of Scientific American Library. W. H. Freeman and Company. ISBN 0–7167–5020–1. Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. J. Physiol., 148:574–591. Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s striate cortex. J. Physiol., 160:106–154. Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5:248–264. Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185– 1201. Bibliography 173 Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaption. Neural Networks, 1:295–307. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3:79–87. Jepson, A. D. and Fleet, D. J. (1990). Scale-space singularities. In Faugeras, O., editor, Computer Vision-ECCV90, pages 50–55. Springer-Verlag. Johansson, B. (1997). Multidimensional signal recognition, invariant to affine transformation and time-shift, using canonical correlation. Master’s thesis, Linköpings universitet. LiTH-ISY-EX-1825. Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag, New York. Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214. Kailath, T. (1980). Linear Systems. Information and System Sciences Series. Prentice-Hall, Englewood Cliffs, N.J. Karhunen, K. (1947). Uber lineare methoden in der Wahrsccheilichkeitsrechnung. Annales Academiae Scientiarum Fennicae, Seried A1: Mathematica-Physica, 37:3–79. Kay, J. (1992). Feature discovery under contextual supervision using mutual information. In International Joint Conference on Neural Networks, volume 4, pages 79–84. IEEE. Knutsson, H. (1982). Filtering and Reconstruction in Image Processing. PhD thesis, Linköping University, Sweden. Diss. No. 88. Knutsson, H. (1985). Producing a continuous and distance preserving 5-D vector representation of 3-D orientation. In IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management - CAPAIDM, pages 175–182, Miami Beach, Florida. IEEE. Report LiTH–ISY– I–0843, Linköping University, Sweden, 1986. Knutsson, H. (1989). Representing local structure using tensors. In The 6th Scandinavian Conference on Image Analysis, pages 244–251, Oulu, Finland. Report LiTH–ISY–I–1019, Computer Vision Laboratory, Linköping University, Sweden, 1989. Knutsson, H., Borga, M., and Landelius, T. (1995). Learning Canonical Correlations. Report LiTH-ISY-R-1761, Computer Vision Laboratory, S–581 83 Linköping, Sweden. 174 Bibliography Kohonen, T. (1972). Correlation matrix memories. IEEE Trans.s on Computers, C-21:353–359. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59–69. Kohonen, T. (1989). Self-organization and Associative Memory. Springer–Verlag, Berlin, third edition. Landelius, T. (1993). Behavior Representation by Growing a Learning Tree. Thesis No. 397, ISBN 91–7871–166–5. Landelius, T. (1997). Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden. Dissertation No 469, ISBN 91–7871–892–9. Landelius, T., Borga, M., and Knutsson, H. (1996). Reinforcement Learning Trees. Report LiTH-ISY-R-1828, Computer Vision Laboratory, S–581 83 Linköping, Sweden. Landelius, T., Knutsson, H., and Borga, M. (1995). On-Line Singular Value Decomposition of Stochastic Process Covariances. Report LiTH-ISY-R-1762, Computer Vision Laboratory, S–581 83 Linköping, Sweden. Lapointe, F. J. and Legendre, P. (1994). A classification of pure malt scotch whiskies. Applied Statistics, 43(1):237–257. Lee, C. C. and Berenji, H. R. (1989). An intelligent controller based on approximate reasoning and reinforcement learning. Proccedings on the IEEE Int. Symposium on Intelligent Control, pages 200–205. Li, P., Sun, J., and Yu, B. (1997). Direction finding using interpolated arrays in unknown noise fields. Signal Processing, 58:319–325. Linsker, R. (1988). 21(3):105–117. Self-organization in a perceptual network. Coputer, Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1:402– 411. Ljung, L. (1987). System Identification. Prentice-Hall. Loéve, M. (1963). Probability Theory. Van Nostrand, New York. Bibliography 175 Luenberger, D. G. (1969). Optimization by Vector Space Methods. Wiley, New York. Marr, D. (1982). Vision. W. H. Freeman and Company, New York. McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133. Mikaelian, G. and Held, R. (1964). Two types of adaptation to an optically–rotated visual field. American Journal of Psychology, 77:257–263. Minsky, M. L. (1961). Steps towards artificial intelligence. In Proceedings of the Institute of Radio Engineers, volume 49, pages 8–30. Minsky, M. L. (1963). Computers and Thought, chapter Steps Towards Artificial Intelligence, pages 406–450. McGraw–Hill. E. A. Feigenbaum and J. Feldman, Eds. Minsky, M. L. and Papert, S. (1969). Perceptrons. M.I.T. Press, Cambridge, Mass. Montanarella, L., Bassani, M. R., and Breas, O. (1995). Chemometric classification of some European wines using pyrolysis mass spectometry. Rapid Communications in Mass Spectrometry, 9(15):1589–1593. Moody, J. and Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281–293. Munro, P. (1987). A dual back-propagation scheme for scalar reward learning. In Proceedings of the 9th Annual Conf. of the Cognitive Science Society, pages 165–176, Seattle, WA. Narendra, K. S. and Thathachar, M. A. L. (1974). Learning automata - a survey. IEEE Trans. on Systems, Man, and Cybernetics, 4(4):323–334. Nordberg, K., Granlund, G., and Knutsson, H. (1994). Representation and Learning of Invariance. Report LiTH-ISY-I-1552, Computer Vision Laboratory, S– 581 83 Linköping, Sweden. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biology, 15:267–273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1:61–68. Oja, E. and Karhunen, J. (1985). On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106:69–84. 176 Bibliography Olds, J. and Milner, P. (1954). Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. J. comp. physiol. psychol., 47:419–427. Pavlov, I. P. (1955). Selected Works. Foreign Languages Publishing House, Moscow. Pearlmutter, B. A. and Hinton, G. E. (1986). G-maximization: An unsupervised learning procedure for discovering regularities. In Neural Networks for Computing: American Institute of Physics Conference Proceedings, volume 151, pages 333–338. Pearson, K. (1896). Mathematical contributions to the theory of evolution–III. Regression, heridity and panmixia. Philosophical Transaction of the Royal Society of London, Series A, 187:253–318. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559–572. Pollen, D. A. and Ronner, S. F. (1983). Visual cortical neurons as localized spatial frequency filters. IEEE Trans. on Syst. Man Cybern., 13(5):907–915. Riedmiller, M. and Braum, H. (1993). A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Transactions on Neural Networks, 2:173–175. Ritter, H., Martinetz, T., and Schulten, K. (1989). Topology conseving maps for learning visuomotor-coordination. Neural Networks, 2:159–168. Ritter, H., Martinetz, T., and Schulten, K. (1992). Neural Computation and SelfOrganizing Maps. Addison-Wesley. Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323:533–536. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM J. Res. Develop., 3(3):210–229. Sanger, T. D. (1988). Stereo disparity computation using gabor filters. Biological Cybernetics, 59:405–418. Bibliography 177 Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer feedforward neural network. Neural Networks, 12:459–473. Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275:1593–1599. Shannon, C. E. (1948). A mahtematical theory of communication. The Bell System Technical Journal. Also in N. J. A. Sloane and A. D. Wyner (ed.) Claude Elwood Shannon Collected Papers, IEEE Press 1993. Skinner, B. F. (1938). The Behavior of Organisms: An Experimental Analysis. Prentice–Hall, Englewood Cliffs, N.J. Smith, R. E. and Goldberg, D. E. (1990). Reinforcement learning with classifier systems. Proceedings. AI, Simulation and Planning in High Autonomy Systems, 6:284–192. Steinbuch, K. and Piske, U. A. W. (1963). Learning matrices and their applications. IEEE Transactions on Electronic Computers, 12:846–862. Stewart, D. K. and Love, W. A. (1968). A general canonical correlation index. Psychological Bulletin, 70:160–163. Stewart, G. W. (1976). A bibliographical tour of the large, sparse generalized eigenvalue problem. In Bunch, J. R. and Rose, D. J., editors, Sparse Matrix Computations, pages 113–130. Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44. Tesauro, G. (1990). Neurogammon: a neural network backgammon playing program. In IJCNN Proceedings III, pages 33–39. Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Review, 2(8). Monogr. Suppl. Torres, L. and Kunt, M., editors (1996). Video Coding: The Second Generation Approach. Kluwer Academic Publishers. van den Wollenberg, A. L. (1977). Redundancy analysis: An alternative for canonical correlation analysis. Psychometrika, 36:207–209. 178 Bibliography van der Burg, E. (1988). Nonlinear Canonical Correlation and Some Related Techniques. DSWO Press. van der Pol, B. (1946). The fundamental principles of frequency modulation. Proceedings of the IEEE, 93:153–158. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University. Werbos, P. (1992). Handbook of Intelligent Control, chapter Approximate dynamic programming for real-time control and neural modelling. Van Nostrand Reinhold. D. A. White and D. A. Sofge, Eds. Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavoral Sciences. PhD thesis, Harvard University. Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3:179–189. Westelius, C.-J. (1995). Focus of Attention and Gaze Control for Robot Vision. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden. Dissertation No 379, ISBN 91–7871–530–X. Whitehead, S. D. and Ballard, D. H. (1990a). Active perception and reinforcement learning. Proceedings of the 7th Int. Conf. on Machine Learning, pages 179– 188. Whitehead, S. D. and Ballard, D. H. (1990b). Learning to perceive and act. Technical report, Computer Science Department, University of Rochester. Whitehead, S. D., Sutton, R. S., and Ballard, D. H. (1990). Advances in reinforcement learning and their implications for intelligent control. Proceedings of the 5th IEEE Int. Symposium on Intelligent Control, 2:1289–1297. Williams, R. J. (1988). On the use of backpropagation in associative reinforcement learning. In IEEE Int. Conf. on Neural Networks, pages 263–270. Wilson, R. and Knutsson, H. (1989). A multiresolution stereopsis algorithm based on the Gabor representation. In 3rd International Conference on Image Processing and Its Applications, pages 19–22, Warwick, Great Britain. IEE. ISBN 0 85296382 3 ISSN 0537–9989. Wold, S., Ruhe, A., Wold, H., and Dunn, W. J. (1984). The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses. SIAM J. Sci. Stat. Comput., 5(3):735–743. Bibliography 179 Zadeh, L. A. (1968). Fuzzy algorithms. Information and Control, 12:94–102. Zadeh, L. A. (1988). Fuzzy logic. Computer, pages 83–93. 180 Bibliography Author index Anderson, C. W., 16, 20–22 Anderson, J. A., 48 Anderson, T. W., 25 Baird, L. C., 20 Baker, W. L., 52 Ballard, D. H., 19, 33, 35, 37, 38, 40 Barlow, H., 31 Barto, A. G., 16, 17, 19–22 Bassini, M. R., 69 Battiti, R., 11 Becker, S., 31, 69, 105, 108 Bell, A. J., 31 Bellman, R. E., 18 Berenji, H. R., 38 Bernard, S. T., 105 Bienenstock, E., 38 Bloom, F. E., 13 Bock, R. D., 60 Borga, M., 3, 53, 60, 61, 69, 107, 121 Bossom, J., 8, 65 Bower, G. H., 8 Bracewell, R. N., 99, 103 Bradtke, S. J., 20 Braum, H., 12, 147 Breas, O., 69 Bregler, C., 52 Brooks, V. B., 8, 65 Broomhead, D. S., 45 van der Burg, E., 33 Carson, J., 103 Comon, P., 70 Coren, S., 131 Darken, C. J., 45 Das, S., 69 Davis, L., 23 Dayan, P., 13, 20 Denoeux, T., 41 Derin, H., 17 Doursat, R., 38 Duda, R. E., 62 Dunn, W. J., 60, 67 Farell, J. A., 52 Fichsler, M. A., 105 Fieguth, P. W., 69 Field, D. J., 42 Fisher, R. A., 62 Fleet, D. J., 105 Fletcher, R., 11 Fry, T., 103 Földiák, F., 31 Geladi, P., 67 Geman, S., 38 Giles, G. L., 109 Goldberg, D. E., 23, 37 Golub, G. H., 60 Granlund, G. H., 38–40, 51, 52, 99, 101, 103, 109 Gray, R. M., 26, 30 Gullapalli, V., 16, 21 Hart, P. E., 62 182 Haykin, S., 11, 24, 30 Hebb, D. O., 24, 48 Heger, M., 19 Held, R., 8, 65 Hertz, J., 24, 26, 38, 39 Hilgard, E. R., 8 Hinton, G. E., 23, 28, 31, 36, 45, 105, 108 Holland, J. H., 23 Hopfield, J. J., 45 Hornby, A. S., 7 Hotelling, H., 60, 64, 69 Hubel, D. H., 26, 40 Höskuldsson, A., 60, 67 Irving, W. W., 69 Izenman, A. J., 60 Jaakkola, T., 20 Jacobs, R. A., 12, 28, 36, 38 Jepson, A. D., 105 Joahnsson, B., 51 Jolliffe, I. T., 65 Jordan, M. I., 20, 28, 36, 38 Kailath, T., 162 Karhunen, J., 82 Karhunen, K., 64 Kaushal, T. P., 31 Kay, J., 69, 70, 85 Kelly, P. A., 17 Knutsson, H., 3, 38–41, 51–53, 60, 61, 69, 99, 101, 103, 105, 107, 109, 121 Kohonen, T., 26, 27, 48, 49, 53 Kowalski, B. R., 67 Krogh, A., 24, 26, 38, 39 Kunt, M., 65 Landelius, T., 3, 9, 20, 51–53, 60, 61, 69, 107, 146 Lapointe, F. J., 69 Author index Lazerson, A., 13 Lee, C. C., 38 Legellé, R., 41 Legendre, P., 69 Li, P., 69 Linsker, R., 30, 31 Ljung, L., 49 Loéve, M., 64 Love, W. A., 75 Lowe, D., 45 Luenberger, D. G., 11 Marr, D., 105 Martinetz, T., 53 Maxwell, T., 109 McCulloch, W. S., 44 Mikaelian, G., 8, 65 Milner, P., 13 Minsky, M. L., 35, 45, 46 Mitchson, G. J., 31 Montague, P. R., 13, 20 Montanarella, L., 69 Moody, J., 45 Munro, P., 9, 16 Narendra, K. S., 7 Nordberg, K., 39, 109 Nowlan, S. J., 23, 28, 36 Oja, E., 24–26, 82 Olds, J., 13 Omohundro, S. M., 52 Palmer, R. G., 24, 26, 38, 39 Papert, S., 45, 46 Pavlov, I. P., 7 Pearlmutter, B. A., 31 Pearson, K., 25, 64 Piske, U. A. W., 48 Pitts, W., 44 van der Pol, B., 103 Pollen, D. A., 1 Author index Reeves, C. M., 11 Riedmiller, M., 12, 147 Ritter, H., 27, 53 Ronner, S. F., 1 Rosenblatt, F., 45 Ruhe, A., 60, 67 Rumelhart, D. E., 36, 45 Samuel, A. L., 19 Sanger, T. D., 25, 105 Schulten, K., 53 Sejnowski, T. J., 31, 45 Sen, P. K., 69 Shannon, C. E., 28, 29 Singh, S. P, 20 Skinner, B. F., 8 Smith, R. E., 37 Steinbuch, K., 48 Stewart, D. K., 60, 75 Sun, J., 69 Sutton, R. S., 16, 19–22, 36, 55 Tesauro, G., 14 Thathachar, M. A. L., 7 Thorndike, E. L., 7 Torres, L., 65 Van Loan, C. F., 60 Ward, L. M., 131 Watkins, C., 9, 19, 20 Werbos, P. J., 19, 20, 45 Westelius, C-J., 105 Whitehead, S. D., 19, 33, 35, 37 Wiesel, T. N., 26, 40 Williams, R. J., 16, 32, 36, 45 Willsky, A. S., 69 Wilson, R., 105 Wold, H., 60, 67 Wold, S., 60, 67 Wolfram, S., 13, 20 van den Wollenberg, A. L., 60 183 Yu, B., 69 Zadeh, L. A., 37, 38

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement