A Novel Transfer Function for Continuous Interpolation between Summation and Multiplication in

A Novel Transfer Function for Continuous Interpolation between Summation and Multiplication in
DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL
STOCKHOLM, SWEDEN 2015
A Novel Transfer Function for
Continuous Interpolation between
Summation and Multiplication in
Neural Networks
WIEBKE KÖPP
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)
A Novel Transfer Function for Continuous
Interpolation between Summation and
Multiplication in Neural Networks
Kontinuerlig interpolering mellan addition och multiplikation
med hjälp av en lämplig överföringsfunktion i artificiella
neuronnät
WIEBKE KÖPP
[email protected]
Master’s Thesis in Machine Learning
February 17, 2016
Supervisor:
Examiner:
Erik Fransén
Anders Lansner
Project Provider:
TUM BRML Supervisor:
TUM BRML Advisor:
TUM BRML Lab
Patrick van der Smagt
Sebastian Urban
Acknowledgments
This thesis, conducted at the biomimetic robotics and machine learning (BRML) group
at Technische Universität München (TUM), would not have been possible without the
support and encouragement of many people to whom I am greatly indebted.
First, I would like to express my gratitude to my supervisor Patrick van der Smagt for
giving me the opportunity to work on this very interesting topic and warmly welcoming
me into the BRML family.
I would also like to thank my advisor Sebastian Urban. His knowledge and feedback
have been crucial in advancing and completing this work.
A special thanks goes to my supervisors at KTH, Erik Fransén and Anders Lansner,
who not only provided me with their expertise, but also gave me a lot of freedom in
working on this project.
Furthermore, I am grateful for many insightful discussions with my colleagues at
BRML Max Sölch, Marvin Ludersdorfer and Justin Bayer.
Finally, I would like to thank my family and friends for the unconditional love and
support.
Abstract
In this work, we present the implementation and evaluation of a novel parameterizable
transfer function for use in artificial neural networks. It allows the continuous change
between summation and multiplication for the operation performed by a neuron.
The transfer function is based on continuously differentiable fractional iterates of the
exponential function and introduces an additional parameter per neuron and layer.
This parameter can be determined along weights and biases during standard, gradientbased training. We evaluate the proposed transfer function within neural networks by
comparing its performance to conventional transfer functions for various regression
problems. Interpolation between summation and multiplication achieves comparable
or even slightly better results, outperforming the latter on a task involving missing data
and multiplicative interactions between inputs.
Referat
I detta arbete presenterar vi implementationen och utvärderingen av en ny överföringsfunktion till användning i artificiella neuronnät. Den tillåter en kontinuerlig
förändring mellan summering och multiplikation för operationen som utförs av en
neuron. Överföringsfunktionen är baserad på kontinuerligt deriverbara bråkiterationer av den exponentialfunktionen och introducerar en ytterligare parameter för varje
neuron och lager. Denna parameter kan bestämmas längs vikter och avvikelser under
vanlig lutningsbaserad träning. Vi utvärderar den föreslagna överföringsfunktionen
inom neurala nätverk genom att jämföra dess prestanda med konventionella överföringsfunktioner för olika regressionsproblem. Interpolering mellan summering och
multiplikation uppnår jämförbara eller något bättre resultat, till exempel för en uppgift
som gäller saknade data och multiplikativ interaktion mellan indata.
Contents
1. Introduction
2. Background
2.1. Feed-Forward Artificial Neural Networks . . . . . . . . . . . . .
2.2. Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . .
2.2.1. Differentiation of Real- and Complex-Valued Functions
2.2.2. Linear Interpolation and Extrapolation . . . . . . . . . .
2.3. Software Components . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1. CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2. Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4. Further Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Fractional Iterates of the Exponential Function
3.1. Abel’s Functional Equation . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1. Continuous Differentiability of the Real-Valued Fractional Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2. Derivatives Involved in the Real-Valued Fractional Exponential .
3.2. Schröder’s Functional Equation . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1. Derivatives Involved in the Complex-Valued Fractional Exponential
3.3. Architecture of a Neural Net using Fractional Iterates of the Exponential
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
6
6
7
8
8
10
11
12
12
14
15
15
17
20
4. Implementation Details
4.1. Implementation of the Real- and Complex-Valued Fractional Exponential
4.2. Integration in Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3. Issues with Precision and Accuracy . . . . . . . . . . . . . . . . . . . . .
4.3.1. Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2. Numerical Verification . . . . . . . . . . . . . . . . . . . . . . . . .
4.4. Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
21
23
25
25
27
29
5. Experimental Results
5.1. Recognition of Handwritten Digits . . . . . . . . . . . . . . . . . . . . . .
34
34
5.2. Multinomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1. Generalization in Multinomial Regression . . . . . . . . . . . . .
5.3. Double Pendulum Regression . . . . . . . . . . . . . . . . . . . . . . . . .
37
39
42
6. Conclusion and Future Work
44
List of Figures
46
List of Tables
48
Bibliography
49
Appendix A. Plot Techniques
A.1. Domain Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2. Box-and-Whisker Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
52
53
1. Introduction
Conventional artificial neural networks (ANNs) compute weighted sums over inputs in
potentially multiple layers. Considering one layer in an ANN with multiple inputs x j
and multiple outputs yi , the value of one output neuron yi is computed as
!
∑ Wij x j
yi = σ
.
(1.1)
j
where σ is a non-linear transfer function, e.g. σ( x ) = 1/(1 − e− x ) or σ ( x ) = tanh( x ).
ANNs computing the operation above for all neurons will be referred to as additive
ANNs.
An alternative operation has been first explored in Durbin and Rumelhart (1989) where
inputs are instead multiplied
and weights are included as powers to the inputs. Then,
Wij
yi is given by yi = σ ∏ j x j
. This expression can be reformulated as
"
yi = σ exp
(1)
∑ Wij exp
j
!#
(−1)
xj
.
(1.2)
Here, f (n) denotes the iterated application of an invertible function f : C → C, so
that f (1) is function f itself, f (0) (z) = z and f (−1) is the inverse of f . In general, the
following two equations hold:
f (n) ( z ) = f ◦ f ◦ · · · ◦ f ( z )
|
{z
}
f
(n)
◦f
(m)
= f
n times
(n+m)
,
n, m ∈ Z
(1.3)
(1.4)
Through close examination of (1.2), we see that replacing the iteration arguments n = 1
and n = −1 of exp(n) , occurring in (1.2), by 0 leads exactly to expression (1.1). Therefore,
finding a function exp(n) that is defined for values n ∈ R, which obviously includes all
real values between −1 and 1, can be used for smooth interpolation between the two
settings. exp(n) is called a fractional iterate of the exponential function. In order to be
1
1. Introduction
used with ANNs trained by gradient-based optimization methods, such as stochastic
gradient descent, the function additionally needs to be continuously differentiable.
In this report, we will show how fractional iterates of the exponential function can be
used within a novel transfer function that enables interpolation between summation
and multiplication for the operation of a neuron. With the idea first introduced in Urban
and Smagt (2015), one of the contributions of this work is the implementation of two
different approaches for fractional iteration of the exponential function for integration
within the widely used machine learning library Theano. Also, we address the trade-off
between accuracy and computational efficiency in computationally intense operations
by both error-analysis and interpolation of the involved functions. Furthermore, the
proposed transfer function is evaluated by comparing its performance to conventional
transfer functions for various regression problems. Here, we mainly focus on datasets
that incorporate multiplicative interaction between inputs, as we expect these will
benefit most from the novel transfer function.
The remainder of this report is structured as follows. First, a brief overview on
artificial neural networks and the procedure to train them is given. Additionally,
this chapter contains related work and provides necessary background information
for concepts applied in the implementation of the proposed transfer function. The
following chapter serves as a summary of Urban and Smagt (2015), where the transfer
function interpolating between summation and multiplication has first been suggested.
We discuss how a solution for exp(n) can be derived and then show how this solution
is used as a transfer function within ANNs. The proof of differentiability for the first
alternative is an original contribution of this work. Experimental results obtained by
applying the novel transfer function within ANNs are discussed in Ch. 5. Finally, in
concluding this report, some pointers to future work are given.
2
2. Background
This chapter covers some concepts that are either necessary for understanding how the
transfer function based on fractional iterates of the exponential function is derived or
have been applied in its implementation and evaluation. Additionally included is a
brief overview on used software components, namely Theano and CUDA.
2.1. Feed-Forward Artificial Neural Networks
As universal function approximators (Hornik, Stinchcombe, and White, 1989), feedforward neural networks or multi-layer perceptrons have been a popular tool for solving
regression and classification problems. Here, we recall their basic elements. A more
detailed introduction can be found in Bishop (2013, Ch. 5). For an in-depth historical
review of research in the field, the reader is referred to Schmidhuber (2015).
x1
h1
x3
x2
h2
h3
y1
y2
input layer
h4
hidden layer
output layer
Figure 2.1.: A two-layer feed-forward neural network with three input neurons, one
hidden layer of four neurons and two outputs.
A multi-layer perceptron, as depicted in Fig. 2.1, consists of an input and an output
layer of neurons as well as an arbitrary number of hidden layers. The number of input
3
2. Background
and output neurons depends on the dimensions of the data the network will be trained
on. An ANN for approximation of function f : X → Y with X ⊆ Rdx , Y ⊆ Rdy will have
an input layer with d x and an output layer of dy neurons. As mentioned in Ch. 1, in
conventional ANNs, the output of one layer is computed as the weighted sum of its
inputs. Often, a so-called bias is then added, before propagating the result through a
non-linear transfer function. The output vector y ∈ Rdy , where each entry contains the
output for one neuron of the output layer, is then computed as
y = σ2 [W2 h + b2 ]
= σ2 [W2 σ1 (W1 x + b1 ) + b2 ]
for an ANN with one hidden layer. Here, the input vector is denoted with x ∈ Rdx
and h represents the result vector of the hidden layer with a given number of entries
dh . W1 ∈ Rdh ×dx and W2 ∈ Rdy ×dh refer to the weights, while b1 ∈ Rdh and b2 ∈ Rdy
constitute the biases for the first and second layer respectively. Recalling the graphical
representation of the ANN in Fig. 2.1, one can think of each entry in the weight
matrices as belonging to one of the edges. Finally, σ1 and σ2 are non-linear transfer
functions. Together, W1 , W2 , b1 and b2 form the set of parameters that are optimized in
training the two-layer ANN. The evaluation of the equations above is considered the
forward-propagation through the network.
Without an additional non-linear transfer or activation functions, an ANN would only
be able to model linear combinations of inputs, which occur only rarely in real-word
datasets. The most commonly used transfer functions (see Fig. 2.2) are the sigmoid,
tanh and rectifier functions. Neurons or units with the latter as their transfer function,
also called rectified linear units (ReLUs).
Neural networks are trained through back-propagation (Rumelhart, G. E. Hinton, and
Williams, 1986). This procedure involves the computation of an error function, that gives
an estimate on how well the network is able to approximate given training data for
the current parameter configuration. Alternative terms for the error function are loss
or cost. Assuming a differentiable cost C, the gradient of that cost with respect to
each individual parameter of the network can be computed by iterated application of
the chain rule of differential calculus. This approach is very similar to reverse-mode
automatic differentiation on expression graphs which is detailed in Sec. 2.3.2. Cost
C can thus be minimized with first-order gradient-based optimization methods, the
simplest of which is gradient descent (GD). During learning, an individual weight is
updated following
∂C
w ← w−α
,
(2.1)
∂w
where α represents the learning rate controlling the speed and nature of learning.
4
2. Background
1
y
0.5
0
−0.5
Sigmoid
Tanh
Rectifier
−1
−4
−2
0
x
2
4
Figure 2.2.: Commonly used non-linear transfer functions for ANNs: Sigmoid
(σ1 = 1+1e−x ), Tanh (σ2 = tanh( x )) and Rectifier (σ3 = max( x, 0)).
An example for a loss function often used in regression problems is the mean-squared
error (MSE) function. Its equation is given below as
CMSE =
1 n
( y i − t i )2 ,
n i∑
=1
(2.2)
with yi denoting the output of the ANN for training sample i and ti containing the
target values for that training sample.
Typically in order to accelerate learning, not the entire training data is used for computing weight updates, but a so-called mini-batch or even just a single data point. This
optimization approach is called stochastic gradient-descent. Over the years many extensions to gradient-based optimization in the context of have been introduced. Among
these are the use of a momentum (Rumelhart, G. E. Hinton, and Williams, 1986) which
incorporates the parameter update of the previous step, or the use of entirely adaptive
optimizers such as Adam (Kingma and Ba, 2015), RMSProp (Tieleman and G. Hinton,
2012) or Adadelta (Zeiler, 2012). These adaptive optimizers have in common that they
change the effective learning rate over the course of training and use past gradients
or gradient-magnitudes in order to accommodate the different nature of individual
parameters.
5
2. Background
2.2. Mathematical Concepts
2.2.1. Differentiation of Real- and Complex-Valued Functions
The derivative of a function f : R → R at point x0 in its domain is defined as
f 0 ( x0 ) = lim
x → x0
f ( x ) − f ( x0 )
.
x − x0
(2.3)
f is differentiable if the limit above exists for all points x0 in its domain (Freitag and
Busam, 2009).
In complex analysis, the derivative of a function f : C → C is defined in the same way
as in real analysis,
f 0 (z0 ) = lim
z → z0
f ( z ) − f ( z0 )
f ( z0 + h ) − f ( z0 )
= lim
.
z − z0
h
h →0
(2.4)
A function f is said to be complex differentiable or analytic if these limits exist for all z0 .
Other concepts from real differentiability such as product and quotient rule apply as
well. However, z and h are complex values and therefore z0 cannot only be approached
from left and right, but any direction.
For a function f (z) = f ( x + iy) = u( x, y) + iv( x, y), we derive
f (( x + h) + iy) − f ( x + iy)
h
u( x + h, y) − u( x, y) iv( x + h, y) − v( x, y)
= lim
+
h
h
h →0
∂u
∂v
= +i
∂x
∂x
lim
h →0
for the limit along the real axis and
f ( x + i (y + h)) − f ( x + iy)
ih
h →0
u( x, y + h) − u( x, y) iv( x, y + h) − v( x, y)
= lim
+
ih
ih
h →0
∂u ∂v
=−i +
∂y ∂y
lim
for the limit along the complex axis.
6
2. Background
y1
y1
y
y0
y
y0
x0
x1
x
x1
(a) Linear extrapolation of y ≈ f ( x ) for
value x based on the two closest known
points.
x0
x
(b) Linear extrapolation of y ≈ f ( x ) for
value x based on the two closest known
points. In this example x1 < x0 < x
holds. Obviously, there exists a second
case in one-dimensional extrapolation,
where x < x0 < x1 .
Figure 2.3.: Linear interpolation and extrapolation in one dimension.
The limits above must be equal in order for f to be complex differentiable. By equating
coefficients, we derive that the following equations, also known as Cauchy-Riemann
equations, must hold for complex differentiable functions,
∂u
∂v
= ,
∂x
∂y
and
∂u
∂v
=− .
∂y
∂x
(2.5)
2.2.2. Linear Interpolation and Extrapolation
A widely-used approach for reducing processing time for evaluation of computationally
intense function are look-up tables (Tang, 1991), where a fixed number of function
values f ( x ) = y are precomputed either at program startup or stored and then read
from file. Values for inputs x not included in the table are approximated linearly.
With known input-value pairs ( x0 , y0 ) and ( x1 , y1 ), the linear approximation f ( x ) with
x0 < x < x1 is computed as
f (x) ≈
x1 − x
x − x0
f ( x0 ) +
f ( x1 ) .
x1 − x0
x1 − x0
An illustration of this principle can be seen in Fig. 2.3a.
7
(2.6)
2. Background
By first interpolating in one dimension and then using the same approach for consecutive dimensions (2.6) can be generalized to multiple dimensions (Cheney and Kincaid,
2012)). Bilinear interpolation with known function values for points p00 , p01 , p10 and
p11 , where p00 = ( x0 , y0 ), p01 = ( x0 , y1 ), p10 = ( x1 , y0 ) and p11 = ( x1 , y1 ) and unknown
value for p = ( x, y) falling inside the rectangle spanned by these four point is computed
in two steps. First, linear interpolation in x-direction yields
x1 − x
f ( p00 ) +
x1 − x0
x1 − x
f ( p01 ) +
f ( x, y1 ) ≈
x1 − x0
f ( x, y0 ) ≈
x − x0
f ( p10 )
x1 − x0
x − x0
f ( p11 ) .
x1 − x0
(2.7)
Linear interpolation in the values computed above results in the final approximation
f(x, y) where
y − y0
y1 − y
f ( x, y0 ) +
f ( x, y1 ) .
(2.8)
f ( x, y) ≈
y1 − y0
y1 − y0
Trilinear interpolation follows the same principle, but with one additional dimension.
Linear extrapolation (Fig. 2.3b) is a common method for computing values for input
values falling outside of the range of precomputed values. Considering again the
one-dimensional case, the point closest to x and the point next to that point are used to
approximate f ( x ). Expressed in terms of these two points x0 and x1 , where x0 is the
point closer to x, f ( x ) is approximated by
f ( x ) ≈ f ( x0 ) +
x − x0
( f ( x1 ) − f ( x0 )) .
x1 − x0
(2.9)
2.3. Software Components
2.3.1. CUDA
CUDA (Nvidia Cooperation, 2015a) is a programming framework by NVIDIA that
enables the usage of GPUs for general purpose parallel computing. CUDA C is its
interface to the C programming language.
A GPU consists of a number of multiprocessors composed of multiple cores, which
execute groups of threads concurrently. Managing those threads is done using an
architecture called SIMT (Single-Instruction, Multiple-Thread), where all cores of one
multiprocessor execute one common instruction at the same time (see Fig. 2.4). This
means that programs not involving data-dependent branching, where all threads follow
the same execution path, are most efficient.
8
2. Background
Device
Multiprocessor N
.
..
Multiprocessor 2
Multiprocessor 1
Shared Memory
Registers
Processor
1
Registers
Processor
2
Registers
...
Processor
M
Instruction
Unit
Constant Cache
Texture Cache
Device Memory
Figure 2.4.: A set of SIMT multiprocessors where multiple processors and threads share
one instruction unit and have access to memory of various kind [Adapted
from Nvidia Cooperation (2015b, Ch. 3)].
Threads have access to several types of memory which differ in properties like latency,
lifetime and scope. One of those memory types is a special kind of read-only memory
called texture memory. It can be used for accessing one-, two- or three-dimensional
data and provides different filtering and addressing modes. The filtering mode refers
to the manner in which a value is computed based on the input coordinates. One
possibility is linear filtering which performs linear, bilinear or trilinear interpolation
between neighboring elements using 9-bit fixed point numbers for the coefficients
weighing the elements. Handling of coordinates that fall out of the range is specified by
the addressing mode. Here, it is possible to either return zero, return the value at the
border of the coordinate range, return the value by warping or mirroring the texture at
the border.
9
2. Background
2.3.2. Theano
Theano (Bergstra, Breuleux, Bastien, et al., 2010; Bastien, Lamblin, Pascanu, et al., 2012)
is a Python library that uses a symbolic representation of mathematical expressions to
build expression graphs which are then used to generate and compile C++ or CUDA C
code for computation on a central processing unit (CPU) or graphical processing unit
(GPU). The nodes in expression graphs represent either variables of specified datatype
and dimension or mathematical operators defining how variables are combined. Before
generating code, expression graphs are optimized in terms of computational efficiency
and numerical stability, e.g by elimination of unnecessary computations such as devision
by one or manipulation of the computational order.
(a)
x
(b)
∂C
∂x
∂C ∂ f
∂ f ∂x
f (x)
f
x
∂C
∂f
...
...
C
∂C
∂C
Figure 2.5.: (a) shows the forward traversal of an expression graph while (b) depicts the
reverse mode for differentiation: The gradient of an operator computing
f ( x ) has to be specified as an expression graph. Its inputs are x and ∂C/∂ f
which has been derived by previous back-propagation steps that started
with ∂C/∂C = 1 and its output has to evaluate to ∂C/∂x. ∂ f /∂x can
usually be either expressed through a graph containing the operator itself
or a designated gradient operator.
Expression graphs additionally allow to automatically derive the gradient with respect
to any parameter involved in the graph by a method called reverse-mode automatic
differentiation (Griewank, 2003; Baydin, Pearlmutter, and Radul, 2015). The gradient of
scalar value C : Rn → R is computed by traversing its expression graph from output
C ( x1 , . . . , xn ) to inputs x1 , . . . , xn while iteratively applying the chain rule of differential
calculus. The process builds a new expression graph using specifications of the gradient
of each node’s output with respect to its inputs. The computation scheme is displayed
in Fig. 2.5, where a node with one input and output is involved in the computation of a
cost C.
10
2. Background
For computations that cannot be expressed by combining already provided operators,
a custom operator can be introduced by directly providing an implementation in C++
or CUDA C and a symbolic expression for the gradient. It is possible to define a new
operator for a scalar only and let Theano automatically generate versions that apply
the scalar operator element-wise to vectors, matrices or tensors.
On the GPU, Theano computations always use single precision floating-point numbers.
Thus, even though the intermediate values in self-introduced computations can have
double precision, all inputs and outputs of operators have single precision.
2.4. Further Related Work
Even though multiplicative interactions have been found useful for certain regression
and classification problems a long time ago (Durbin and Rumelhart, 1989), training
of such networks was challenging due to complex error landscapes (Schmitt, 2002;
Engelbrecht and Ismail, 1999). Some approaches therefore utilized other optimization
procedures, such as genetic algorithms to train ANNs with product units (Janson
and Frenzel, 1993). Additionally, a deliberate decision determining which of the two
operations a specific neuron should perform was necessary, as no efficient algorithm
to learn automatically whether a particular neuron should sum or multiply its inputs
was known. As a result, most research in the past 25 years focused on pure additive
networks. Recently, multiplicative interactions have resurfaced (Droniou and Sigaud,
2013; Sutskever, Martens, and G. E. Hinton, 2011), but to the best of our knowledge
efficiently learning from data whether a particular neuron should perform an additive
or multiplicative combination of its inputs has not been attempted. Additive ANNs
are highly inspired by biology. As Schmitt (2002) suggests, multiplicative interaction
between neurons are biologically justified as well. This has since been investigated by
others, e.g. Schnupp and King (2001) and Gabbiani, Krapp, Hatsopoulos, et al. (2004).
There exist other approaches where the transfer function is learned together with the
usual parameters in ANNs. In both Agostinelli, Hoffman, Sadowski, and Baldi (2014)
and He, Zhang, Ren, and Sun (2015), learnable variants of ReLUs are suggested, that
incorporate learning the transfer function in standard gradient-based training. Previous
approaches followed a hybrid approach, where the transfer function is adapted through
a genetic algorithm and other parameters are found in a conventional way (Yao, 1999;
Turner and Miller, 2014).
We will investigate two approaches for fractional iteration of the exponential function,
one of which introduces the use of complex numbers into ANNs. Some considerations
in training of complex-valued ANNs have been studied in Hirose (2012).
11
3. Fractional Iterates of the Exponential
Function
With the necessity to find an explicit solution for fractional iteration of the exponential
function, the following sections elaborate two possible solutions. Note that the first
solution lacks support for a range of input values, which motivates the second one,
which in turn exhibits singularities at a distinct number of inputs. Additionally, the
use of fractional iteration within ANNs is explained. For a thorough overview on
functional equations for function iteration, see Kuczma, Choczewski, and Ger (1990).
3.1. Abel’s Functional Equation
Abel’s functional equation is given by
ψ( f ( x )) = ψ( x ) + β
for f : C → C and β 6= 0. Simplifying ψ( f n ( x )) using (3.1) leads to
f (n) ( x ) = ψ−1 (ψ( x ) + nβ) .
(3.1)
(3.2)
The equation above is not restricted to whole numbers n. We therefore assume
n ∈ R and use (3.1) with f ( x ) = exp( x ) to find the fractional exponential function f (n) (z) = exp(n) (z). With β = 1 and f −1 ( x ) = log( x ), Abel’s functional equation
can be reformulated as
ψ( x ) = ψ(log( x )) + 1 .
(3.3)
Assuming ψ( x ) = x for 0 ≤ x < 1, we can find a solution for ψ which is valid on all of
R. Since we have log( x ) < x for any x ≥ 1, multiple applications of (3.3) will, at some
point, lead to an argument for ψ that falls into the interval where we assume ψ to be
linear. In case x < 0, we can use ψ( x ) = ψ(exp( x )) − 1 to reach the desired interval,
since 0 < exp( x ) < 1 for x < 0. In combination, this leads to a piece-wise defined
solution for ψ,
ψ( x ) = log(k) ( x ) + k
(3.4a)
with k ∈ N ∪ {−1, 0} : 0 ≤ logk ( x ) < 1 .
(3.4b)
12
3. Fractional Iterates of the Exponential Function
ψ( x )
3
2
1
−5
−10
5
10
x
−1
Figure 3.1.: A continuously differentiable solution ψ( x ) for Abel’s equation (3.1) with
f ( x ) = exp( x ).
exp(n) ( x )
2
1
n=1
−3
−2
−1
−1
1
n = −1
2
3
x
−2
Figure 3.2.: Real-valued fractional exponential function exp(n) ( x ) as specified in (3.6)
for n ∈ {−1, −0.9, . . . , 0, . . . , 0.9, 1}.
The function is displayed in Fig. 3.1.
Because ψ is monotonically increasing on R with range (−1, ∞), its inverse ψ−1 exists
and is computed as
ψ−1 (ψ) = exp(k) (ψ − k )
with k ∈ N ∪ {−1, 0} : 0 ≤ ψ − k < 1
(3.5a)
(3.5b)
The real-valued fractional exponential function, depicted in Fig. 3.2, can then be defined
by using
exp(n) ( x ) = ψ−1 (ψ( x ) + n) .
(3.6)
Note that this solution is not necessarily unique. The derivation of other solutions
under different initial assumptions might be possible.
13
3. Fractional Iterates of the Exponential Function
3.1.1. Continuous Differentiability of the Real-Valued Fractional
Exponential
We first prove that ψ is differentiable and thus also continuous on R. Here, we only
have to consider points where the number of necessary iterations k changes. The
pieces with constant k are differentiable since they are composed of differentiable
functions. k increases at each exp(k) (1) with k ∈ N ∪ {0}. Recalling the definition of
real differentiability in (2.3), we show that the limit given there exists for x0 = exp(k) (1)
by computing the left- and right-hand limit. The left-hand limit is
lim
x → x0−
= lim−
x → x0
= lim−
x → x0
k −1
=∏
j =0
ψ( x ) − ψ(exp(k) (1))
x − exp(k) (1)
log(k) ( x ) + k − [logk+1 (exp(k) (1)) + (k + 1)]
x − exp(k) (1)
k −1
1
log(k) ( x ) − 1 l 0 H
=
lim
∏
(
k
)
(
−
x − exp (1)
x → x0 j=0 log j) ( x )
1
exp(k− j) (1)
.
In addition to l’Hôpital’s rule (l’H), we make use of the fact that log j (exp(k) ( x )) =
expk− j ( x ) and log(1) = 0. The right-hand limit is
lim
x → x0+
= lim+
x → x0
= lim+
x → x0
k
=∏
j =0
ψ( x ) − ψ(exp(k) (1))
x − exp(k) (1)
logk+1 ( x ) + (k + 1) − [0 + (k + 1)]
x − exp(k) (1)
k
logk+1 ( x ) l 0 H
1
=
lim
∏
(
+
x − exp(k) (1)
x → x0 j=0 log j) ( x )
1
exp(k− j) (1)
.
The only difference in the two limits above is the upper limit of the product, introducing
an additional factor 1/ exp(k−k) = 1 for x → x0+ . The limits are therefore equal.
Using an analogous proof, it can be shown that ψ−1 is differentiable as well. Further0
more, derivatives ψ0 and ψ−1 are continuous. This can be shown with a similar analysis
of left- and right-hand limit for exp(k) (1).
14
3. Fractional Iterates of the Exponential Function
3.1.2. Derivatives Involved in the Real-Valued Fractional Exponential
The derivates for ψ defined by (3.4a) and ψ−1 specified in (3.5a) are given as

1
 ∏ k −1
k≥0
j=0 log( j) ( x )
ψ0 ( x ) =
exp( x )
k = −1
(3.7)
with k as specified in (3.4b) and
ψ
− 10
(
(ψ) =
∏kj=1 exp( j) (ψ − k)
1
ψ +1
k≥0
k = −1
(3.8)
with k fullfiling (3.5b).
Consequently, the derivatives of exp(n) defined by (3.6) are given by
∂ exp(n) ( x )
= ψ0−1 (ψ( x ) + n) ψ0 ( x )
∂x
0
∂ exp(n) ( x )
exp(n ) ( x ) =
= ψ0−1 (ψ( x ) + n) .
∂n
exp0(n) ( x ) =
(3.9a)
(3.9b)
3.2. Schröder’s Functional Equation
The solution found for exp(n) in the previous section evaluates to −∞ or is undefined
in some cases, e.g. for x < 0 and n = −1, so that exp(n) ( x ) = log( x ). To deal with
negative x for n ≤ −1 fractional iterates of the complex exponential function have to
be used. The fractional complex exponential function can be obtained as a solution of
Schröder’s functional equation, defined as
χ( f (z)) = γ χ(z)
(3.10)
with γ ∈ C.
Schröder’s functional equation can be solved similarly to how we found a solution to
Abel’s equation before. Repetitive application of (3.10) leads to
f (n) (z) = χ−1 (γn χ(z))
(3.11)
and as before, we start with a local solution for χ and then generalize so that χ is
defined on all C.
15
3. Fractional Iterates of the Exponential Function
Two important properties of the complex exponential function are that it is not injective
since
√
exp(z + 2πi ) = exp(z), n ∈ Z, i = −1
(3.12)
and that it has a fixed point at, amongst others,
exp(c) = c ≈ 0.3181315 + 1.3372357i .
(3.13)
This means that the range for the inverse of exp has to be restricted on an interval
such that the complex logarithm log(z) is an injective mapping. We choose log : C →
{z ∈ C : −1 ≤ Im z ≤ −1 + 2π }. This is commonly referred to as choosing a branch of
the logarithm. For this branch of the logarithm, the fixed point c is attractive1 , which
means that iterated application of the logarithm will converge to c. An exception are
points z ∈ {0, 1, e, ee . . . } due to the occurrence of logn (z) = 0 at some iteration n.
Examining the behavior of the complex exponential function within a small circle with
radius r0 around fixed point c leads to approximation
exp(z) = cz + c − c2 + O(r02 )
(3.14)
It follows that points on a circle around c with radius r for a sufficiently small r are
mapped to points on a larger circle with radius |c|r since
exp(c + reiφ ) = c(c + reiφ ) + c − c2
= c + creiφ .
(3.15)
All points are rotated by Im c ≈ 76.618◦ .
Substituting approximation (3.14) into (3.10) as χ(cz + c − c2 ) = γχ(z) leads to a local
solution of χ with
χ(z) = z − c
where |z − c| ≤ r0 and γ = c .
(3.16)
Since c is an attractive fixed point, we can use the reformulated version
χ(z) = c χ(log(z))
(3.17)
of (3.10) to find a solution to Schröder’s equation for the complex-valued exponential
function,
χ(z) = ck (logk (z) − c)
with k = arg min
| log
0
k ∈ N0
1 See
(k0 )
( z ) − c | ≤ r0 .
Urban and Smagt (2015) or Kneser (1950) for a detailed proof.
16
(3.18a)
(3.18b)
3. Fractional Iterates of the Exponential Function
Im
44
Im z
22
00
−2-2
−4-4
−-44
−-22
00
Re z
22
44
Re
Figure 3.3.: Domain coloring plot (see App. A.1) of χ(z). Discontinuities arise at 0, 1,
e, ee , . . . and stretch into the negative complex half-plane. They are caused
by log being discontinuous at the polar angles −1 and −1 + 2π.
Solving for z gives the inverse,
χ−1 (χ) = expk (c−k χ + c)
with k = arg min
|c
0
k ∈ N0
−k0
χ | ≤ r0 .
(3.19a)
(3.19b)
Finally, the complex-valued fractional exponential is computed as
exp(n) (z) = χ−1 (cn χ(z))
(3.20)
Figs. 3.3 and 3.4 display function χ and samples of exp(n) respectively.
3.2.1. Derivatives Involved in the Complex-Valued Fractional Exponential
Kneser (1950) shows that χ is an analytic function (and thus complex differentiable) for
all points in C except {0, 1, e, ee , . . . }.
The derivatives of χ defined in (3.18a) and χ−1 specified in (3.19a) are given as
χ0 (z) =
k −1
c
∏ log( j) z
j =0
17
(3.21)
3. Fractional Iterates of the Exponential Function
Re exp(n) (z)
4
Im exp(n) (z)
4
2
−3 −2 −1
2
1
2
3
Re z
−3 −2 −1
−3
2
3
2
3
Re z
−3
Re exp(n) (z)
4
Im exp(n) (z)
4
2
−3 −2 −1
1
2
1
2
3
Re z
−3 −2 −1
−3
1
Re z
−3
Figure 3.4.: Iterates of the exponential function exp(n) (z) with Im z = 0.5 for n ∈
{0, 0.1, . . . , 0.9, 1} (upper plots) and n ∈ {0, −0.1, . . . , −0.9, −1} (lower plots)
obtained using the solution (3.20) of Schröder’s equation. Exp, log and the
identity function are highlighted in orange.
with k fulfilling (3.18b) and
0
χ −1 ( χ ) =
1
ck
k
∏ exp( j) (c−k χ + c)
(3.22)
j =1
with k according to (3.19b).
Fractional iteration of the complex exponential function defined in (3.20) is thus differentiated as
exp0(n) (z) = cn χ0−1 [cn χ(z)] χ0 (z)
exp
(n0 )
(z) = cn+1 χ0−1 [cn χ(z)] χ(z) .
18
(3.23a)
(3.23b)
3. Fractional Iterates of the Exponential Function
x1
x̃11
x̃21
x̃12
x1
x3
x2
x̃13
x̃22
ñ1
x̃23
x3
x2
x̃1
ñ2
x̃2
ñ3
W
W
n1
ŷ1
y1
n2
x̃3
ŷ2
ŷ1
n̂1
y2
y1
(a) One layer of an ANN using the fractional exponential function, where
output yi is computed according to
(3.24).
n̂2
ŷ2
y2
(b) One layer of an ANN using the fractional exponential function, where
output yi is computed according to
(3.25).
Figure 3.5.: Possible structures for ANNs using the fractional exponential function. It is
possible to use an additional non-linear transfer function σ after its second
application in one layer. For complex valued fractional exponential, the
weight matrix W contains complex numbers.
16
14
exp(n) (exp(−n) 2 + exp(−n) 7)
Re (C-exp(n) )
R-exp(n)
12
10
n
0.2
0.4
0.6
0.8
1
Figure 3.6.: Continuous interpolation between summation and multiplication for values 2 and 7 using real-valued (R-exp(n) ) and complex-valued (C-exp(n) )
fractional iteration of the exponential function.
19
3. Fractional Iterates of the Exponential Function
3.3. Architecture of a Neural Net using Fractional Iterates of
the Exponential Function
One possibility to use the fractional exponential within an ANN is by introducing a
parameter ni for each neuron yi in a layer and compute its value as
"
!#
yi = σ exp(ni )
∑ Wij exp(−n ) (x j )
i
.
(3.24)
j
However, this requires to compute exp(−ni ) ( x j ) separately for each pair of neurons (i, j),
increasing computational complexity significantly. Therefore, a different net structure
is proposed by Urban and Smagt (2015) where
"
!#
yi = σ exp(n̂i )
∑ Wij exp(−ñ ) (x j )
j
.
(3.25)
j
This decouples the parameters of exp(n) and consequently arbitrary combinations of
n̂i and ñ j are possible. A visualization of the two possible structures above is given in
Fig. 3.5.
The major advantage of interpolating between summation and multiplication for
the operation of a neuron with fractional iterates of the exponential function is that
the results can be computed for a whole layer utilizing matrix arithmetics, whose
implementation on both CPU and GPU is very efficient, just as we showed in Sec. 2.1.
An example for interpolating between summation and multiplication for specific values
is shown in Fig. 3.6.
20
4. Implementation Details
This section details how the real- and complex fractional exponential functions are
implemented in an efficient manner and discusses their integration in Theano. Considerations about numerical accuracy and improvements in computational efficiency are
covered thereafter.
4.1. Implementation of the Real- and Complex-Valued
Fractional Exponential
The functions ψ and ψ−1 for the real-valued fractional exponential as well as χ and χ−1
for the complex-valued fractional exponential and their derivatives are introduced to
Theano as custom operators with both C++ and CUDA C code. The implementations
for CPU and GPU differ mostly in syntax, thus pseudocode is presented here. All
algorithms have a common basic structure. First, corner cases if any exist are handled.
Then, the number of necessary iterations k is determined either through direct computation or iteration until the respective condition on k is fulfilled. If possible, the iterated
logarithm or exponential is computed while attempting to fulfill the condition on k.
Algorithm 1 Computation of ψ( x ).
if x < 0 then return exp( x ) − 1
else
k←0
while x > 1 and k < k max (= 5) do
x ← log( x ); k ← k + 1
end while
return x + k
end if
Algorithm 1 and 2 compute ψ (3.4a) and ψ−1 (3.5a). The algorithms for the respective
derivatives follow the same computational flow. However, for input values not applicable to the corner cases, a parameter initialized with one is added before the while-loop.
21
4. Implementation Details
Algorithm 2 Computation of ψ−1 (ψ). The value for ψ−1 with ψ < −1 would mathematically be −∞. Similarly, inputs with ψ & 4.406 evaluate to values larger than can be
represented with single precision floating point numbers. Since infinite values cause
problems in gradient-based learning, we restrict ψ−1 to the interval [−10, 10]. ψmin and
ψmax are set as ψmin = −0.999955 and ψmax = 2.83403, so that ψ−1 (ψmin ) = −10 and
ψ−1 (ψmax ) = 10.
if ψ < ψmin then return ψ−1 (ψmin )
else if x > ψmax then return ψ−1 (ψmax )
else
k = dψ − 1.0e
if k < 0 then
return log(ψ − k )
end if
ψ ← ψ−k
while k > 0 do
ψ ← exp(ψ); k ← k − 1
end while
return ψ
end if
Algorithm 3 Computation of χ(z). In order to evaluate χ even for the singularities at
0, 1, e, . . . a small epsilon is added to the input for those numbers. Choosing e = 0.01
and δ = 10−7 leads to a small absolute error for known relations exp(0) (z) = z and
exp(1) (z) = ez .
if z ∈ {0, 1, e} then
. when e.g. |z − 1| < δ
z ← z+e
end if
k ← 0; cpow ← 1.0
while |z − c| ≥ r0 and k < k max (= 50) do
z ← log(z); cpow ← cpow · c; k ← k + 1
end while
return cpow · (z − c)
22
4. Implementation Details
Algorithm 4 Computation of χ−1 (χ). Here, a second loop over k is required as the
argument of exp depends on k as well.
k ← 0;
while |χ| ≥ r0 and k < k max (= 50) do
χ ← χ/c; k ← k + 1
end while
χ ← χ+c
for j ← 0; j < k; j ← j + 1 do
χ ← exp(χ)
end for
return χ
It is used to accumulate intermediate logarithms or exponentials inside the loop for
computing the products occurring in (3.7) and (3.8).
Algorithm 3 and 4 compute χ (3.18a) and χ−1 (3.19a). r0 is set to 10−4 . As before,
derivative computations follow the same algorithmic structure, but require an additional
parameter for accumulation of intermediate logarithms and exponentials. Due to the
lack of a complex datatype in Theano, complex-valued operators are realized with two
in- and outputs representing real and imaginary part.
All algorithms above include a certain number of iterations k. Fig. 4.1 shows how k
evolves for different inputs. This gives both an order of computational steps in the
newly introduced operators and also motivates the choice for k max as 5 and 50. Plots
4.1a and 4.1b display ranges for k, e.g. in Fig. 4.1a, a value of x = 2, requires k = 1
iterations. Plots 4.1c and 4.1d show the areas in which a certain number of iterations is
required, e.g. in Fig. 4.1c, we can see that k = 35 holds for z = 6 − 4i. The large number
of iterations in computations for the complex-valued fractional exponential function
causes problems with both accuracy and speed.
4.2. Integration in Theano
In addition to an implementation, custom operators need to provide information on
how their gradient is computed. For real-valued operators ψ and ψ−1 this is achieved
by following the principle introduced in Section 2.3.2. Fig. 4.2 extends this principle to
operators computing a complex-valued function f (z) = f ( x + iy) = u( x, y) + iv( x, y),
e.g χ(z).
23
4. Implementation Details
k = −1
k=0
k=1
(a)
k=3
k=2
(b)
0
2
x
6
4
15
k
20
0
25
30
2
ψ
4
35
(d)
(c)
3
6
0
e
0
−3
Im χ
c
Im z
3
0 1
−3
−6
−6 −3
0
3
Re z
−3
6
0
Re χ
3
Figure 4.1.: Number of iterations k, thus order of computational steps for real ((a),(b))
and complex ((c),(d)) input values for (a) ψ( x ), (b) ψ−1 (ψ), (c) χ(z), (d)
χ −1 ( χ ).
In our case, f 0 (z), e.g χ0 (z) is computed by another custom operator. As both χ
and χ−1 fulfill the Cauchy-Riemann equations (2.5), we know that Re f 0 = ∂u/∂x
and Im f 0 = ∂v/∂x. Therefore, the expressions to be computed within reverse-mode
differentiation of cost C simplify to
∂C
∂C
∂C
=
Re f 0 +
Im f 0
∂x
∂u
∂v
(4.1a)
∂C
∂C
∂C
= − Im f 0 +
Re f 0 .
∂y
∂u
∂v
(4.1b)
We have now fully defined custom operators for scalar values. Using a wrapper that
generates a version that operates element-wise on multidimensional data, the operators
can be combined for computing the real- and complex-valued fractional exponential
within ANNs.
24
4. Implementation Details
...
∂C
∂x
=
∂C ∂u
∂u ∂x
+
∂C ∂v
∂v ∂x
...
y
x
∂C
∂y
=
∂C ∂u
∂u ∂y
+
∂C ∂v
∂v ∂y
f ( x, y) = u( x, y) + v( x, y) i
∂C
∂u
u
∂C
∂v
v
...
...
Figure 4.2.: The node computing the complex function f ( x + iy) with inputs x, y and
outputs u, v has to specify the gradients of scalar cost C with respect to
each input. The inputs x, y as well as the gradients ∂C/∂u and ∂C/∂v are
given. ∂C/∂x and ∂C/∂y are computed as shown above and serve as inputs
for the gradient computation of the nodes for which x, y are the outputs.
4.3. Issues with Precision and Accuracy
Floating point arithmetics using a fixed number of bits are subject to rounding errors of
precision-dependent magnitude. Accumulation of those errors can lead to inaccurate
results (Higham, 2002). This is particularly true for computations involving many steps,
as each step potentially produces less accurate results due to previously made rounding
errors.
Especially iterates of exp as they occur in calculating ψ−1 and χ−1 can amplify small
input errors significantly even with a considerably small number of iterations. With
a maximum of 4 steps on the considered interval with input between −8 and 8, numerical accuracy is not an issue for the real-valued fractional exponential. However,
the complex-valued version typically involves a much larger number of steps. Subsequent sections provide an estimate on the relative error of iterating exp and show the
influence of choosing either single or double precision on all expressions in numerical
experiments.
4.3.1. Error Estimation
Assuming independence between input variables, the common formula to calculate
propagation of error by estimating the standard variation σ f for function f ( x, y, . . . )
based on standard variations σx , σy , . . . is given (Philip R. Bevington, 2003) by
σ2f
=
∂f
∂x
2
σx2
+
25
∂f
∂y
2
σy2 + . . .
(4.2)
4. Implementation Details
z
f
f (z)
δ
Figure 4.3.: For a continuous function f and an input z, that is subject to some small
absolute error δ, the function value for the erroneous input will be located
within an area bounded by the mapping of the δ-circle around z through f .
Thus, for exp( x + iy) = exp( x ) [cos(y) + i sin(y)] with x̂ = Re exp( x + iy) and further
assuming that sin(y)2 and cos(y)2 can be at most 1, we derive
2
2 2
σexp
( x ) = exp( x ) σx
2
σcos
= sin(y)2 σy2 ≈ σy2
y
2
2
2
⇒ σx̂ = exp( x )2 σcos
(y) + cos( y ) σexp( x )
(4.3)
≈ exp(2x )σy2 + exp(4x )σx4 .
This means that, in the worst case, the error grows exponentially in the input for just
the real part of one iteration of exp. Consequently the estimate yielded by the generic
method for error propagation shown above is too coarse for our purpose.
A better approximation on how an error in the input propagates for multiple iterations
of exp can be derived as follows. When computing χ−1 , condition (3.19b) ensures that
exp(z) is only evaluated for points z ∈ C0 = {c + reiφ | φ ∈ [0, 2π ], r ≤ r0 }. An error
in the original input χ to χ−1 or rounding error introduced by division χ/ck can also
only lead to erroneous points within C0 . Furthermore, the exponential function is
continuous. Therefore, values in close proximity in the input, will also be close to each
other in the input (see Fig. 4.3).
An estimate on the average relative error E(n) for computing n iterates of the exponential function of points z ∈ C0 can therefore be derived by considering the expansion of
C0 under the same transformation. We approximate Earea (n) as the square-root of the
relation between the area covered by computing iterates of points within C0 and the
area of C0 .
s
s
An
area(exp(n) (z) | z ∈ C0 )
=
(4.4)
Earea (n) =
area(C0 )
2πr02
As mentioned before (see Section 3.2), points on a circle with small radius r around
c are mapped to points on a larger circle with radius |c|r. With a rising number of
26
4. Implementation Details
n
26
27
28
29
30
31
32
33
Ecircle (n)
Earea (n)
Esingle (n)
Edouble (n)
3 910
5 375
7 389
10 156
13 960
19 189
26 376
36 255
3 978
5 552
7 860
11 441
17 621
30 623
70 222
328 253
3 949
5 463
7 604
10 764
15 733
24 504
43 765
131 981
7 319
10 125
14 102
19 963
29 136
45 340
80 839
245 512
Table 4.1.: Error estimation and numerically computed errors for n iterations. Ecircle (n)
denotes the error computed under the assumption that the covered shape
is and stays a circle. Earea (n) refers to the error derived by (4.4). A notable
difference begins to occur at iteration 27. The two estimations are compared
to the numerically derived relative error of computing n iterations of exp for
exact test-points with single and double precision.
iterations for exp(n) this condition does not hold anymore and the covered area diverges
from a circular shape. Then, the area can be derived by sampling points on C0 and
computing the area of the bounding polygon after iterating the points with exp. For
n ≥ 33 the shape becomes self-intersecting, since exp is not an injective function on
C. The development of the covered area is shown in Fig. 4.4. Table 4.1 shows the
computed error estimations. Since numeric computation for iteration of exp for exact
values yields relative errors of the same order of magnitude as our estimations, it can
be concluded that the approximation by considering the development of the area is
valid.
4.3.2. Numerical Verification
The findings above are verified in a numerical experiment, where χ(z), χ0 (z), χ−1 (χ),
0
0
χ−1 (χ), exp(n) (z), exp0(n) (z) and exp(n ) (z) are evaluated for random numbers z ∈
{−8 − 8i, 8 + 8i } and n ∈ {−1, 1}. The ground-truth values for the experiment are generated with Mathematica, which while generally using double precision computations
as well, is optimized to use higher precision where it is deemed necessary for accurate
0
results. χ−1 and χ−1 use these ’true’ values for χ as input.
Fig. 4.5 displays the relative errors computed as |z − ẑ|/|z|, where ẑ refers to the
numerically computed and z to the true value. As suspected, expressions involving
27
4. Implementation Details
Im
3
2
1
29
−3
−2
28
−1
33
30
c
31
32
1
2
3
Re
−1
−2
Figure 4.4.: Development of the area covered by points in C0 after application of exp(n)
for different values of n.
single precision
double precision
10−1
10−3
10−5
10−7
10−9
10−11
χ
χ −1
χ0
χ −1
0
0
exp(n) exp0(n) exp(n )
Figure 4.5.: Box-and-whiskers plot (see App. A.2) for the relative error |z − ẑ|/|z| for
the result of computing the respective expressions for one million complex
test points with single and double precision.
28
4. Implementation Details
the exponential function, have a higher relative error, even for exact inputs. Also, the
difference between single and double precision is significant for all expressions.
The precision-related accuracy issues discussed above have a significant effect in the
case of usage within an ANN, where exp(n) is evaluated at least once per layer and
error-prone gradient calculations can hinder learning. While the use of double precision
provides higher accuracy, execution times on GPUs are significantly longer due to less
optimized arithmetics. Except for the obvious use of more memory and even slower
computation times due to less parallelization, a similar effect cannot be observed for
CPU when switching from single to double precision.
4.4. Interpolation
In addition to the issues with accuracy, calculating the complex-valued exp(n) and
its derivatives is rather time-consuming as a result of the high number of computational steps when compared to regular transfer functions. Training ANNs with
the suggested transfer function within reasonable time thus requires a more efficient
computation procedure. We will consider two interpolation methods. Method A uses
2d-interpolation1 to interpolate χ, χ−1 and derivatives for real and imaginary input
and uses exact computation for cn . Fractional iteration of the exponential function is
then computed according to (3.20) with interpolated χ and χ−1 . Method B additionally
utilizes parameter n to perform 3d-interpolation of exp(n) directly.
0
Precomputed values for χ(z), χ0 (z), exp(n) (z), exp0(n) (z) and exp(n ) (z) with inputs
z ∈ {−8 − 8i, 8 + 8i } and n ∈ {−1, 1} sampled at different resolutions are derived
with Mathematica and then saved in binary files. For inverse functions χ−1 (χ) and
0
χ−1 (χ), the range for χ is chosen so that it covers the range of cn χ(z), thus χ ∈
{−6.9 − 4i, 5.5 + 7.5i }. For computations on the CPU these binary files are then accessed
through memory-mapping and an implementation of bi- or trilinear interpolation
following the equations in Section 2.2.2. For computations on the GPU, the precomputed
values are copied over to texture memory and can then be accessed through linear
filtering (see Section 2.3.1).
For input values falling outside of the ranges specified above, one approach yielding
continuous transitions at borders would be linear extrapolation. However, determining
on which side or face of the shape defined by the sampling ranges an input point
lies, requires a case differentiation with 3d − 1 cases, where d is the dimension (see
Algorithm 5). Such data-dependent branching is not ideal for a SIMT architecture,
1 The
real and imaginary part of the argument are treated as two separate dimensions.
29
4. Implementation Details
Method A (2D)
Method B (3D)
105
100
10−5
10−10
exp(n)
exp0(n)
0
exp(n )
Figure 4.6.: Relative error for the result of computing the respective expressions for
one million complex test points with the highest sampling resolution for
method A (2d) and method B (3d).
where multiple threads share one instruction unit and therefore all threads have to
execute a branch, if branching occurs for at least one of the threads. Thus, we use the
mirroring addressing mode of texture memory instead. This yields both continuous
transitions at borders and provides gradient-information at all input points.
Algorithm 5 Extrapolation in one dimension according to (2.9), where function f ( x )
was precomputed with x ∈ [ xmin , xmax ] and resolution r x .
if xmin ≤ x ≤ xmax then
return regular interpolation
else
if x < xmin then
x0 ← xmin , x1 ← xmin − r x
else
x0 ← xmax , x1 ← xmax + r x
end if
return f ( x0 ) + xx1−−xx00 ( f ( x1 ) − f ( x0 ))
end if
In order to evaluate different sampling resolutions with respect to runtime, accuracy
and memory usage, we use the test setup discussed in Section 4.3.2. The results are
shown in Tables 4.2 and 4.3. Higher resolutions than the ones specified are not feasible
due to memory limitations.
30
4. Implementation Details
CPU
rz
0.1
0.05
0.01
0.0075
0.005
0.0035
GPU
erel
−
[10 5 ]
t
[ms]
erel
−
[10 5 ]
t
[ms]
M
[MB]
27.45
6.757
0.278
0.160
0.080
0.050
269.13
274.57
388.48
421.63
478.90
490.75
31.05
9.698
1.313
0.987
0.685
0.541
9.29
10.38
15.80
15.72
16.56
16.85
0.62
2.45
60.91
108.20
243.46
496.60
Table 4.2.: Results for the interpolation of exp(n) with method A. rz is the sampling
resolution used for choosing the values to be precomputed. t denotes the
runtime of computing function values for one million test points and M
gives the required memory for the respective setting. All GPU computations
have been conducted on a Nvidia Quadro K2200. The used CPU is a Intel
Xeon E3-1226 v3 @ 3.30 GHz.
CPU
rz
rn
0.1
0.1
0.05
0.05
0.1
0.05
0.075
0.075
0.1
0.05
0.1
0.05
0.01
0.025
0.01
0.005
GPU
erel
[10−3 ]
t
[ms]
9.865
2.487
9.790
2.456
0.116
0.619
0.101
0.003
63.83
123.31
168.43
180.06
187.69
190.78
200.08
212.79
erel
t
[10−3 ] [ms]
9.877
2.497
9.799
2.465
0.138
0.628
0.129
0.005
6.58
7.59
7.65
8.54
8.00
8.55
8.72
8.12
M
[MB]
12.46
24.32
49.52
96.70
119.25
191.03
210.69
420.32
Table 4.3.: Results for the interpolation of exp(n) with method B. rz and rn are the
sampling resolutions used for choosing the values to be precomputed. All
other columns correspond to the respective columns in Table 4.2.
31
4. Implementation Details
0
5
4.00
3.00
2.00
1.00
0
× 10-5
1 · 10−5 2 · 10−5 3 · 10−5 4 · 10−5
5
4.00
0
2.00
Im z
Im z
3.00
0
1.00
−5
0
-5
× 10
−5
0
Re z
5
−5
−5
(a) Relative error for exp(0.75) using 2dinterpolation and rz = 0.0035. Interpolation is expectedly difficult at singularities 0, 1, e, . . . and around the branchcut. Other areas would benefit from a
finer sampling resolution.
0
Re z
5
(b) Relative error for exp(0.75) using 3dinterpolation and rz = 0.075 and rn =
0.005. Compared to the plot for 2dinterpolation the grid visible is not distorted due to the direct interpolation.
Figure 4.7.: Relative error for exp(0.75) using method A and B with the respective highest
resolutions.
Method A is more accurate for almost all settings. A similar accuracy can only be
reached for the highest sampling resolution in n for method B. The higher impact rn on
the error can also be seen in the fact that settings with the same rn , but higher sampling
resolution rz give only marginally better results. A direct comparison between the two
settings with the respective highest resolutions (see Fig. 4.6) shows that even through
their median relative errors lie close to each other, the interpolation with Method B still
covers a wider range. However, as less computational effort is needed in method B, it
is generally faster than method A, which involves two interpolation operations and the
exact computation of cn .
The differences in accuracy between CPU and GPU on the same sampling resolution
arise from the internal use of 9-bit fixed point numbers for the coefficients in texture
fetches on GPU, which have lower precision than the 32-bit floating point numbers used
32
4. Implementation Details
on CPU. Furthermore, processing time increases with higher memory consumption.
This is most likely the result of more cache misses deriving from accessing files or
textures in a very random manner. Even though an accuracy better than the one seen
for single-precision numbers in Section 4.3.2 can be achieved by interpolation, Fig. 4.7
shows that there are some problematic areas for interpolation. As the primary goal
of using interpolation is a more efficient computation procedure, Method B will be
used in further testing. While this definitely reduces processing time, it remains to be
verified if interpolation has a negative impact on learning behavior within ANNs.
33
5. Experimental Results
In the following, we investigate how ANNs using fractional iterates of the exponential
function as transfer functions and therefore interpolation between summation and
multiplication perform on different datasets. Alongside standard gradient descent
we also apply adaptive optimizers such as Adam (Kingma and Ba, 2015), RMSProp
(Tieleman and G. Hinton, 2012) or Adadelta (Zeiler, 2012) which are provided by Bayer
(2013) for simple use with Theano.
5.1. Recognition of Handwritten Digits
We perform multi-class logistic regression with one hidden layer on the MNIST dataset
(LeCun, Bottou, Bengio, and Haffner, 1998). It contains 60 000 images of handwritten
digits zero to nine in its training set and 10 000 images in its test set. 10 000 samples of
the training set are used as a validation set. Each image has 784 pixels, which form the
input to the hidden layer.
We compare a conventional additive ANN to ANNs using the real- and complexvalued fractional exponential function as follows. In order for all networks to have
approximately the same number of trainable parameters, the additive network and
the network using the real fractional exponential function use 200 hidden units while
the ANN with the complex fractional exponential function has 100 hidden units. The
setup resembles the one used for presenting benchmark results for Theano in Bergstra,
Breuleux, Bastien, et al. (2010). Additionally, all nets include a sigmoid transfer function
σ( x ) = 1/(1 − ex ) within the hidden layer. Outputs are thus computed according to
(1.1) and (3.25). For the sake of investigation of the impact of interpolating exp(n) , both
a network calculating exact values and a network interpolating exp(n) with presented
interpolation method B1 are tested.
All networks are trained with stochastic gradient descent (SDG) using an initial learning
rate of 10−1 , a momentum of 0.5, and a minibatch size of 60 samples. The learning rate
1 Interpolation
in three dimensions for real- and imaginary part of z and parameter n. Here, we use the
highest resolution tested, where rz = 0.075 and 0.005 (see Section 4.4).
34
5. Experimental Results
300
ñ
200
60
40
20
0
−1
100
0
−1
−0.5
0
0.5
1
n̂
−0.5
0
0.5
1
(a) Histograms of for the ANN using the real fractional exponential function (R-exp(n) ).
200
30
ñ
20
100
0
−1
n̂
10
−0.5
0
0.5
0
−1
1
−0.5
0
0.5
1
(b) Histograms for the ANN using the complex fractional exponential function (C-exp(n) ).
30
ñ
200
20
100
0
−1
n̂
10
−0.5
0
0.5
0
−1
1
−0.5
0
0.5
1
(c) Histograms for the ANN using the complex fractional exponential function with interpolation
(n)
method B (C-expB ).
Figure 5.1.: Histograms (each with 20 bins) for parameters n̂ and ñ within the hidden
layer of an ANN using the real- and complex-valued fractional exponential
function and multi-class logistic regression on the MNIST dataset.
35
5. Experimental Results
ANN type
additive
R-exp(n)
(n)
C-expB
C-exp(n)
n pars
nd / s
i
error [%]
159 010
159 994
159 894
159 894
32387
21253
17147
707
283
245
263
402
2.03
2.53
2.67
2.81
Table 5.1.: Results for using ANNs of different type for classification of handwritten
digits. R-exp(n) and C-exp(n) stand for nets with real- and complex fractional
exponential function. Subscript B indicates that interpolation method B has
been used. i is the iteration at which the best validation loss has been reached.
The error denotes how many digits within the test set have been wrongly
classified at that iteration. Furthermore, the number of parameters n par and
number of data samples processed per second nd / s are given.
is reduced whenever the validation loss does not improve for 100 consecutive iterations
over the whole training set. Initially, all weights are set to zero with exception of the
real-part of the weights within the hidden layer. These are initialized by randomly
sampling from the uniform distribution over the interval
s
" s
#
6
6
−4
,4
,
(5.1)
nin + nout
nin + nout
following a suggestion from Glorot and Bengio (2010) for weight initialization in ANNs
with a sigmoid transfer function. Here, nin and nout denote the number of input and
output neurons for the hidden layer.
Table 5.1 shows the results for training the four networks until convergence. Both
structures using interpolation between summation and multiplication achieve similar
though not surpassing performance on the MNIST dataset. Furthermore, there is
a difference in performance between the two complex-valued nets. The net using
interpolation is numerically more stable when large gradients cause n̂ and ñ to become
large. Values outside [−1, 1] are mapped back into this interval due to the mirroring
addressing mode used in interpolation. Fig. 5.1 displays histograms for parameters n̂
and ñ for each of the tested ANN architectures at the iteration with the best validation
loss. While many of the parameters remain close to initialization value zero, some have
changed notably and thus the ANNs with our proposed transfer function performs
operations beyond pure additive interactions. It is also clearly visible just from the
histograms that the versions with and without interpolation reach significantly different
solutions.
36
5. Experimental Results
optimizer
mom
test loss
[10−6 ]
i
σ
Adam
Adam
Adam
SGD
RMSProp
0.0
0.5
0.99
0.99
0.99
1.76
2.53
2.54
2.95
3.19
828
610
728
981
735
linear
linear
linear
tanh
linear
Table 5.2.: The best performing models for an ANN learning a multivariate polynomial
with the real fractional exponential function. i denotes the iteration at which
the test loss given in the next column has been reached. Note that σi is the
additional non-linear transfer function in the first layer. The second layer
uses no additional non-linear transfer function.
5.2. Multinomial Regression
With the motivation to analyze the behavior of parameter n in ANNs that use interpolation between summation and multiplication, we examine a synthetic dataset that
exhibits multiplicative interactions between inputs. The function to be learned is a
randomly generated multivariate polynomial or multinomial
f ( x, y) = a00 + a10 x + a01 y + a11 xy + a20 x2 + a02 y2
(5.2)
of second degree with two inputs x and y. This function can be computed by a two-layer
ANN with six hidden units, where the first layer performs multiplication with the
weight-matrix containing the respective powers for each summand and the second layer
uses addition with coefficients aij as weights.
We generate a dataset with a total of 10 000 samples by first randomly sampling the six
coefficients aij uniformly from the interval [0, 1]. Then, input values x, y are sampled
from the same interval and the respective output is computed according to (5.2). From
those 10 000 samples, 8 000 randomly chosen samples are used as the training set. The
remaining samples are split into a validation set and a test set of equal size.
Using a network with the same structure as needed for an exact solution, meaning two
layers and six hidden units, we perform a hyper-parameter search to find a combination
of parameters where function f ( x, y) is approximated well for both a net using the real
and complex fractional exponential function. Tested combinations include different
optimizers (SDG, Adam, RMSProp and Adadelta), momentums, and setting of an
additional non-linear transfer function σ (linear, sigmoid, tanh) in the first layer. Since
37
5. Experimental Results
optimizer
mom
test loss
[10−6 ]
i
σ
Adam
Adam
RMSProp
Adam
RMSProp
0.99
0.0
0.99
0.5
0.99
0.99
1.08
1.10
1.13
1.16
732
601
200
396
325
tanh
linear
linear
linear
linear
Table 5.3.: The best performing models for an ANN learning a multivariate polynomial with the complex fractional exponential function. All other columns
correspond to the respective columns in Table 5.2.
10−3
R-exp(n) GD
R-exp(n) Adam
C-exp(n) GD
C-exp(n) Adam
C-exp(n) Adadelta
mse loss
10−4
10−5
10−6
0
100
200
300
400 500
iteration
600
700
800
900
Figure 5.2.: Test ( ) and training loss ( ) for training a two-layer ANN for selected
optimizers learning a multivariate second-degree polynomial of two variables.
38
0
−0.1
−0.2
−0.3
n̂
ñ
5. Experimental Results
0
200
600
400
iteration
800
0.4
0
−0.4
−0.8
−1.2
0
200
600
400
iteration
800
200
600
400
iteration
800
1.2
0.9
0.6
0.3
−0.1
n̂
ñ
(a) Parameters in the first layer.
0
200
600
400
iteration
800
0.6
0.2
−0.2
−0.6
0
(b) Parameters in the second layer.
Figure 5.3.: Progression of parameter n in an ANN learning a multivariate polynomial.
The parameters change most in the very beginning and than stabilize
quickly.
the learning rate is again reduced upon no further improvement, different learning rates
are not part of the search. The initial learning rate is set to 10−2 and all hyper-parameter
combinations use a batchsize of 100 samples.
Table 5.3 shows the settings for the five best performing models for a net using the
real and complex fractional exponential function. Although gradient descent performs
reasonably well for a small ANN, adaptive optimization methods generally yield better
results. Fig. 5.2 additionally displays the progression of test and training loss for
selected optimizers. Progression of parameter n in case of the best model with gradient
descent is shown in Fig. 5.3
5.2.1. Generalization in Multinomial Regression
Closer investigation of the multivariate polynomial used in the previous section reveals
that it can be learned even by an additive net with the same number of hidden units.
The employed coefficients yield an almost bilinear function. Additionally, with 8 000
training samples in the two-dimensional unit box, it is likely that for each test point
39
5. Experimental Results
100
tanh
R-exp(n)
exp-log
mse loss
10−1
10−2
10−3
10−4
10−5
0
0.5
1
1.5
iteration
2
2.5
·104
Figure 5.4.: Test ( ) and training loss ( ) loss for approximation of a multivariate
polynomial for different net structures.
there exists a training sample close to it. Therefore, in order to analyze long-range
generalization performance, we use a more complex polynomial of fourth degree with
two inputs and split test and training set in a non-random way. Input values are
sampled from the interval [0, 1]. However, all points within a circle around (0.5, 0.5)
are used as the test set and all points outside that circle make up the training set.
We train three different three-layer net structures with the Adam-optimizer, where the
final layer is additive with no additional transfer function in all cases. The first neural
network is purely additive with tanh as transfer function for the first two layers. The
second network uses the real fractional exponential function in the first two layers. The
third and final network uses a multiplicative first layer, i.e. (1.2) is used as the transfer
function. The progression of training and test loss for all three structures is displayed
in Fig. 5.4.
Here, our proposed transfer function generalizes best. This can also be seen when
reviewing the relative error of the approximation (see Fig. 5.5) computed by comparing
function value and approximation for points on a regular grid. Parts of the circle, where
the network has no training points are approximated well. It remains to be examined
why the network containing the multiplicative layer, thus resembling the model that
actually generated the data, performs worst.
40
5. Experimental Results
1
0.8
0.8
0.5
y
0
0.4
0.6
0
0.2 0.4 0.6 0.8
x
0.2
−1
1
0.2
0.4
−0.5
0.2
0.4
y
0.6
0
1
1
0
0
0.2 0.4 0.6 0.8
x
1
0
(a) Data to be learned. Training points are col- (b) Relative error for approximation for ANN
ored in blue.
with R-exp(n) as the used transfer function.
1
1
0.8
0.8
0.4
y
0.6
y
0.6
0.4
0.2
0.4
0.2
0
0.2
0.4
0.2
0
0.2 0.4 0.6 0.8
x
1
0
0
0
0.2 0.4 0.6 0.8
x
1
0
(c) Relative error for approximation for ANN (d) Relative error for approximation for ANN
with tanh as the used transfer function.
with exp-log as the used transfer function.
Figure 5.5.: Relative error for approximation of a multivariate polynomial for the three
tested ANN architectures.
41
5. Experimental Results
5.3. Double Pendulum Regression
Consider a double pendulum as shown in Fig. 5.6. Its kinematic state is determined by
the angles θ1 , θ2 and the corresponding angular momentums, in this case parameterized
by the corresponding angular velocities w1 = dθ1 /dt, w2 = dθ2 /dt. Using Lagrangian
mechanics the angular accelerations a1 = dw1 /dt, a2 = dw2 /dt with only gravity acting
as an external force on the system are determined to
a1 = [m2 l1 w12 sin(δθ ) cos(δθ )+
m2 g sin(θ2 ) cos(δθ )+
m2 l2 w22 sin(δθ )−
(5.3a)
(m1 + m2 ) g sin(θ1 )] / D1
a2 = [ − m2 l2 w22 sin(δθ ) cos(δθ )+
(m1 + m2 ) g sin(θ1 ) cos(δθ )−
(m1 + m2 )l1 w12 sin(δθ )−
(5.3b)
(m1 + m2 ) g sin(θ2 )] / D2
with
δθ = θ2 − θ1
D1 = (m1 + m2 )l1 − m2 l1 cos(δθ ) cos(δθ )
(5.3c)
D2 = (m1 + m2 )l2 − m2 l2 cos(δθ ) cos(δθ )
where g = 9.81 is standard gravity, l1 and l2 are the lengths of the pendulums and
m1 , m2 are their respective masses. Given an initial state θ1 (t = 0), θ2 (t = 0), w1 (t =
0), w2 (t = 0) the trajectory of the double pendulum can be obtained by numerical
integration of (5.3a) and (5.3b) (Levien and Tan, 1993).
We train an ANN to perform regression of the accelerations a1 , a2 given the state of the
double pendulum. We sampled 1800 states from a uniform distribution; 1600 samples
were used for training and the rest as an independent test set. The employed ANN has
three hidden layers using the real fractional exponential and a sigmoid as an additional
non-linear transfer function.
Training was performed using the following procedure. After random weight initialization using a zero-mean normal distribution, standard loss minimization training
using the Adam-optimizer was performed while keeping ñ and n̂ constant at zero for a
fixed number of iterations. Then, the constraint on ñ and n̂ was lifted and training was
continued.
42
5. Experimental Results
y
x
θ 1 l1
m1
θ2
l2
m2
Figure 5.6.: Double pendulum where l1 and l2 are the lengths of the pendulums, m1 ,
m2 are their respective masses, and θ1 , θ2 corresponds to the corresponding
angular momentums.
net structure
1. exp(n) with free n after 200 000 iterations
2. exp(n) with n = 0
3. sigmoid
training loss
test loss
0.134
0.497
0.117
0.105
0.549
0.361
Table 5.4.: Results for learning accelerations a1 , a2 of a double pendulum with different
net structures after 2 000 000 iterations.
For comparison a copy of the neural net without lifting the constraints and a neural
network using the standard sigmoid transfer function were also trained for the same
number of iterations.
While the first network with free n outperforms the second network in generalization
performance on the test set, both are exceeded by the standard sigmoid network (see
Table 5.4). Since exp(n) is the identity function for n = 0, networks 2 and 3 should have
equal performance; nevertheless the bounds on ψ−1 in Algorithm 2 limit the identity
function to the interval [−10, 10] as well. This poses an implementation trade-off that
seems to have a negative effect on performance for at least this regression problem.
43
6. Conclusion and Future Work
The objective of this thesis was to efficiently implement and evaluate a novel transfer
function enabling smooth interpolation between summation and multiplication with
fractional iterates of the exponential function for the use in ANNs.
While real-valued fractional iterates of the exponential function can be efficiently evaluated with exact computations, the complex-valued version requires an approximation
method for training ANNs in reasonable time. We have tested bi- and trilinear interpolation of exp(n) . With similar results in terms of accuracy, interpolation in three
dimensions is significantly faster than interpolation for two dimensions and therefore
has been used in subsequent experiments. However, this method requires almost
500 MB of memory on the GPU and can only be used on GPUs with sufficient space
designated as texture memory. Future work therefore includes the examination of other
approximation methods. One possibility is the use of interpolation for a certain number
of iterations of the exponential function thus avoiding both inaccuracy and efficiency
issues for a high number of iterations. Better memory utilization in interpolation of χ−1
could possibly be achieved by exploiting the fact that the range of cn χ covers roughly
the shape of an ellipse with an irregular interpolation grid.
Experiments on different regression datasets have revealed that usage of the proposed
transfer function achieves comparable or sometimes slightly better results compared
to conventional additive ANNs. Overall, adaptive optimizers have been shown to be
superior to regular stochastic gradient descent in learning the operation of individual
neurons. Specifically, Adam-optimization method outperforms other optimizers for
classification of handwritten digits and approximation of multinomials. Many of our
observations thus far lack an explanation, suggesting further directions for future work.
This includes the investigation of learning data with negative inputs, especially with
real fractional iterates of the exponential function. In this scope, the choice of cut-off
parameters ψmin and ψmax (see Algorithm 2) has to be reevaluated. Furthermore, we
noticed that parameter n sometimes grows beyond the interval of [−1, 1]. It should
be tested if this behavior can be remedied with by either adding regularization for
parameter n to the loss currently used or restricting the change of n during learning
through capping the update step.
44
6. Conclusion and Future Work
Finally, the usage of fractional functional iterates could be investigated for other
functions used in ANNs. We believe that with a continuously differentiable solution
for either Abel’s or Schröder’s functional equation, e.g. f (z) = tanh(z), an ANN could
learn the order of non-linearity for a neuron.
In conclusion, we think that while further investigation of the proposed transfer
function is necessary to issue a definitive judgement on its usefulness, this method
has the potential to advance the field of deep learning and therefore to contribute
to future steps towards more sophisticated machine learning and general-purpose
artificial intelligence, which in turn will have a significant impact (Rudin and Wagstaff,
2013) on both science and society.
45
List of Figures
2.1.
2.2.
2.3.
2.4.
2.5.
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
Multi-layer perceptron with one hidden layer . . . . . . . . . . . . . . . .
Commonly used non-linear transfer functions . . . . . . . . . . . . . . .
Linear interpolation and extrapolation in one dimension . . . . . . . . .
Hardware model with CUDA’s SIMT architecture . . . . . . . . . . . . .
Reverse-mode automatic differentiation in Theano by iterative application
of chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A continuously differentiable solution ψ( x ) for Abel’s equation . . . . .
Real-valued fractional exponential function exp(n) ( x ) . . . . . . . . . . .
Domain coloring plot of χ(z), the solution of Schröder’s functional equation
Samples of the complex-valued fractional exponential function exp(n) ( x )
Possible structures for ANNs using the fractional exponential function .
Example for continuous interpolation between summation and multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1. Number of computational steps k involved in evaluation of the fractional
exponential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2. Reverse-mode automatic differentiation adapted to complex-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3. Transformation of a small absolute error δ around an input z to a function f
4.4. Development of the area covered by points in C0 by application of exp(n)
4.5. Relative errors for single and double precision . . . . . . . . . . . . . . .
4.6. Relative errors for interpolation methods A and B . . . . . . . . . . . . .
4.7. Relative error for exp(0.75) using methods A and B . . . . . . . . . . . . .
5.1. Histograms for parameters ñ and n̂ within the hidden layer for different
transfer function types . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2. Test and training loss a two-layer ANN learning a second-degree polynomial of two variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3. Progression of parameter n in an ANN learning a multivariate polynomial
5.4. Test and training loss for approximation of a multivariate polynomial .
46
3
5
7
9
10
13
13
17
18
19
19
24
25
26
28
28
30
32
35
38
39
40
List of Figures
5.5. Relative error for approximation of a multivariate polynomial for the
three tested ANN architectures . . . . . . . . . . . . . . . . . . . . . . . .
5.6. Double pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
43
A.1. Domain coloring plot of f (z) = z . . . . . . . . . . . . . . . . . . . . . . .
A.2. More examples for domain coloring plots . . . . . . . . . . . . . . . . . .
A.3. A box-and-whisker plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
53
53
47
List of Tables
4.1. Error estimation and numerically computed errors for n iterations . . .
4.2. Quantitive results for the interpolation of exp(n) with method A . . . .
4.3. Quantitive results for the interpolation of exp(n) with method B . . . . .
5.1. Results for using ANNs of different type for classification of handwritten
digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2. The best performing models for an ANN learning a multivariate polynomial with the real fractional exponential function . . . . . . . . . . . . .
5.3. The best performing models for an ANN learning a multivariate polynomial with the complex fractional exponential function . . . . . . . . . .
5.4. Results for learning accelerations in a double pendulum . . . . . . . . .
48
27
31
31
36
37
38
43
Bibliography
Agostinelli, F., M. Hoffman, P. Sadowski, and P. Baldi (2014). “Learning Activation
Functions to Improve Deep Neural Networks.” In: eprint: arXiv:1412.6830[cs.NE].
Bastien, F., P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard,
D. Warde-Farley, and Y. Bengio (2012). “Theano: New Features and Speed Improvements.” In: eprint: arXiv:1211.5590[cs.SC].
Baydin, A. G., B. A. Pearlmutter, and A. A. Radul (2015). “Automatic Differentiation in
Machine Learning: A Survey.” In: eprint: arXiv:1502.05767[cs.SC].
Bayer, J. (2013). climin: optimization, straight-forward. url: http://climin.readthedocs.
org/en/latest/ (visited on 12/14/2015).
Bergstra, J., O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley, and Y. Bengio (2010). “Theano: a CPU and GPU Math Expression
Compiler.” In: Proceedings of the 9th Python for Scientific Computing Conference (SciPy).
Ed. by S. van der Walt and J. Millman, pp. 3–10.
Bishop, C. M. (2013). Pattern Recognition and Machine Learning. Information Science and
Statistics. Springer.
Cheney, E. W. and D. R. Kincaid (2012). Numerical Mathematics and Computing. 7th ed.
Brooks Cole.
Droniou, A. and O. Sigaud (2013). “Gated Autoencoders with Tied Input Weights.” In:
Proceedings of the 30th International Conference on Machine Learning, ICML 2013. Vol. 28.
JMLR Proceedings, pp. 154–162.
Durbin, R. and D. E. Rumelhart (1989). “Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropagation Networks.” In: Neural
Computation 1.1, pp. 133–142.
Engelbrecht, A. P. and A. Ismail (1999). “Training Product Unit Neural Networks.” In:
Stability and Control: Theory and Applications 2, pp. 59–74.
Freitag, E. and R. Busam (2009). Complex Analysis. 2nd ed. Universitext. Springer.
49
Bibliography
Frigge, M., D. C. Hoaglin, and B. Iglewicz (1989). “Some Implementations of the
Boxplot.” In: The American Statistician 43.1, pp. 50–54.
Gabbiani, F., H. G. Krapp, N. Hatsopoulos, C.-H. Mo, C. Koch, and G. Laurent (2004).
“Multiplication and Stimulus Invariance in a Looming-Sensitive Neuron.” In: Journal
of Physiology-Paris 98.1, pp. 19–34.
Glorot, X. and Y. Bengio (2010). “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In: International Conference on Artificial Intelligence and
Statistics, pp. 249–256.
Griewank, A. (2003). “A Mathematical View of Automatic Differentiation.” In: Acta
Numerica 12, pp. 321–398.
He, K., X. Zhang, S. Ren, and J. Sun (2015). “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification.” In: The IEEE International
Conference on Computer Vision (ICCV).
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics.
Hirose, A. (2012). Complex-Valued Neural Networks. 2nd ed. Studies in Computational
Intelligence 400. Springer.
Hornik, K., M. B. Stinchcombe, and H. White (1989). “Multilayer Feedforward Networks
are Universal Approximators.” In: Neural Networks 2.5, pp. 359–366.
Janson, D. J. and J. F. Frenzel (1993). “Training Product Unit Neural Networks with
Genetic Algorithms.” In: IEEE Expert 8, pp. 26–33.
Kingma, D. and J. Ba (2015). “Adam: A Method for Stochastic Optimization.” In: eprint:
arXiv:1412.6980[cs.LG].
Kneser, H. (1950). “Reelle analytische Lösungen der Gleichung . . . und verwandter
Funktionalgleichungen.” In: Journal für die reine und angewandte Mathematik 187,
pp. 56–67.
Kuczma, M., B. Choczewski, and R. Ger (1990). Iterative Functional Equations. Encyclopedia of Mathematics and its Applications. Cambridge University Press.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). “Gradient-Based Learning
Applied to Document Recognition.” In: Proceedings of the IEEE 86.11, pp. 2278–2324.
Levien, R. and S. Tan (1993). “Double Pendulum: An Experiment in Chaos.” In: American
Journal of Physics 61, pp. 1038–1038.
50
Bibliography
Nvidia Cooperation (2015a). CUDA C Programming Guide v7.5. url: https://docs.
nvidia.com/cuda/cuda-c-programming-guide/ (visited on 09/01/2015).
– (2015b). Parallel Thread Execution ISA v4.3. url: http://docs.nvidia.com/cuda/
parallel-thread-execution/ (visited on 09/01/2015).
Philip R. Bevington, D. K. R. (2003). Data Reduction and Error Analysis for the Physical
Sciences. 3rd ed. McGraw-Hill.
Rudin, C. and K. L. Wagstaff (2013). “Machine Learning for Science and Society.” In:
Machine Learning 95.1, pp. 1–9.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning Representations by
Back-Propagating Errors.” In: Nature 6088.323, pp. 533–536.
Schmidhuber, J. (2015). “Deep Learning in Neural Networks: An Overview.” In: Neural
Networks 61, pp. 85–117.
Schmitt, M. (2002). “On the Complexity of Computing and Learning with Multiplicative
Neural Networks.” In: Neural Computation 14.2, pp. 241–301.
Schnupp, J. W. H. and A. J. King (2001). “Neural Processing: The Logic of Multiplication
in Single Neurons.” In: Current Biology 11.16, R640–R642.
Sutskever, I., J. Martens, and G. E. Hinton (2011). “Generating Text with Recurrent Neural Networks.” In: Proceedings of the 28th International Conference on Machine Learning,
ICML 2011. Ed. by L. Getoor and T. Scheffer. Omnipress, pp. 1017–1024.
Tang, P. T. P. (1991). “Table-Lookup Algorithms for Elementary Functions and their
Error Analysis.” In: IEEE Symposium on Computer Arithmetic, pp. 232–236.
Tieleman, T. and G. Hinton (2012). Lecture 6.5 - rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Turner, A. J. and J. F. Miller (2014). “NeuroEvolution: Evolving Heterogeneous Artificial
Neural Networks.” In: Evolutionary Intelligence 7.3, pp. 135–154.
Urban, S. and P. van der Smagt (2015). “A Neural Transfer Function for a Smooth
and Differentiable Transition Between Additive and Multiplicative Interactions.” In:
eprint: arXiv:1503.05724[stat.ML].
Wegert, E. (2012). Visual Complex Functions: An Introduction with Phase Portraits. Springer.
Yao, X. (1999). “Evolving Artificial Neural Networks.” In: Proceedings of the IEEE 87.9,
pp. 1423–1447.
Zeiler, M. D. (2012). “ADADELTA: An Adaptive Learning Rate Method.” In: eprint:
arXiv:1212.5701[cs.LG].
51
A. Plot Techniques
A.1. Domain Coloring
Domain coloring refers to the visualization of complex-valued functions (Wegert, 2012,
Ch. 2.5) based on the HSB-color model. HSB stands for hue, saturation and brightness
and each value can vary between 0 and 1. The hue is used to represent the phase φ of
z = reiφ , the saturation encodes the magnitude r and for brightness real and imaginary
part are combined. The exact computation for the latter two involves taking the sin of
either z or its real or imaginary part, leading to reparative patterns.
4
4
2
2
0
0
Im z
Im
−2
−4
-2
-4
-4
−4
−2
-2
0
Re z
0
2
2
4
4
Figure A.1.: Domain coloring plot of f (z) = z.
This becomes more clear when looking at the domain plot for the identity function
f (z) = z (see Fig. A.1). One can see the color depending on the angle as well as
circles of varying saturation around the origin. The grid-like structure originates from
the brightness being computed in terms of real and imaginary part as well as total
magnitude. More examples are displayed in Fig. A.2.
52
Im
44
Im
44
22
22
Im z
Im z
A. Plot Techniques
00
−2- 2
−4- 4
00
−2- 2
−4
-4
−2
-2
0
Re z
0
2
2
4
4
−4- 4
Re
(a) Domain coloring plot of f (z) = exp(z).
−- 44
−- 22
00
Re z
22
44
Re
(b) Domain coloring plot of f (z) = log(z) with
log : C → {z ∈ C : −1 ≤ Im z ≤ −1 + 2π }.
Figure A.2.: More examples for domain coloring plots.
A.2. Box-and-Whisker Plot
Box-and-whisker plots (Frigge, Hoaglin, and Iglewicz, 1989) are used to display some
statistical measures of one-dimensional data, such as the relative error in Fig. 4.5.
Among those statistical measures are the median, the first quartile Q1 , and the third
quartile Q3 (see Fig. A.3). Quartiles mark the boundary of the box and are the result of
partitioning the data into four sets of equal size. The whiskers are computed with the
interquartile range (IQR).
outliers
1.5
Q3 + 1.5 IQR
Q3
1
median
IQR = Q3 − Q1
0.5
Q1
Q1 − 1.5 IQR
sample
Figure A.3.: An examplary box-and-whisker plot.
53
www.kth.se
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement