Point-to-Point Connectivity Between Neuromorphic Chips using Address-Events Kwabena A. Boahen

Point-to-Point Connectivity Between Neuromorphic Chips using Address-Events Kwabena A. Boahen
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
100
Point-to-Point Connectivity Between
Neuromorphic Chips using Address-Events
Kwabena A. Boahen
Abstract— I discuss connectivity between neuromorphic
chips, which use the timing of fixed-height, fixed-width,
pulses to encode information. Address-events—log2 (N )-bit
packets that uniquely identify one of N neurons—are used
to transmit these pulses in real-time on a random-access,
time-multiplexed, communication channel. Activity is assumed to consist of neuronal ensembles—spikes clustered
in space and in time. I quantify tradeoffs faced in allocating
bandwidth, granting access, and queuing, as well as throughput requirements, and conclude that an arbitered channel
design is the best choice. I implement the arbitered channel
with a formal design methodology for asynchronous digital
VLSI CMOS systems, after introducing the reader to this
top-down synthesis technique. Following the evolution of
three generations of designs, I show how the overhead of
arbitrating, and encoding
and decoding, can be reduced in
√
area (from N to N ) by organizing neurons into rows and
columns, and reduced in time (from log2 (N ) to 2) by exploiting locality in the arbiter tree and in the row–column
architecture, and clustered activity. Throughput is boosted
by pipelining and by reading spikes in parallel. Simple techniques that reduce crosstalk in these mixed analog–digital
systems are described.
Keywords— Spiking Neurons, Interchip Communication,
Asynchronous Logic Synthesis, Virtual Wiring.
I. Connectivity in Neuromorphic Systems
E
NGINEERS are far from matching either the efficacy
of neural computation or the efficiency of neural coding. Computers use a million times more energy per operation than brains do [1]. Video cameras uses a thousand
times more bandwidth per bit of information than retinas
do (see Section II-A). We cannot replace damaged parts
of the nervous system because of these shortcomings. To
match nature’s computational performance and communication efficiency, we must co-optimize information processing and energy consumption.
A small—but growing—community of engineers is attempting to build autonomous sensorimotor systems that
match the efficacy and efficiency of their biological counterparts by recreating the function and structure of neural
systems in silicon. Taking a structure-to-function approach, these neuromorphic systems go beyond bioinspiration [2], copying biological organization as well as
function [3], [4], [5].
Neuromorphic engineers are using garden-variety VLSI
CMOS technology to achieve their goal [6]. This effort
is facilitated by similarities between VLSI hardware and
neural wetware. Both technologies:
• Provide millions of inexpensive, poorly-matched devices.
• Operate in the information-maximizing low-signal-tonoise/high-bandwidth regime.
K. A. Boahen morphs brains into silicon at the Bioengineering
Dept, University of Pennsylvania, Philadelphia PA 19104-6392. Email: [email protected]
And challenged by these fundamental differences:
• Fan-ins and fan-outs are about ten in VLSI circuits versus several thousand in neural circuits.
• Most digital VLSI circuits are synchronized by an external clock, whereas neurons use the degree of coincidence in
their firing times to encode information.
Neuromorphic engineers have adopted time-division multiplexing to achieve massive connectivity, inspired by
its success in telecommunications [7] and computer networks [8]. The number of layers and pins offered by commercial microfabrication and chip-packaging technologies
are severely limited. Multiplexing leverages the 5-decade
difference in bandwidth between a neuron (hundreds of
Hz) and a digital bus (tens of megahertz), enabling us to
replace thousands of dedicated point-to-point connections
with a handful of high-speed metal wires and thousands of
switches (transistors). It pays off because transistors occupy less area than wires, and are becoming relatively more
compact in deep submicron processes.
In adapting existing networking solutions, neuromorphic
architects are challenged by huge differences between the
requirements of computer networks and those of neuromorphic systems. Whereas computer networks connect thousands of computers at the building- or campus-level, neuromorphic systems need to connect millions of neurons at
the chip- or circuit-board level. Hence, they must improve
the efficiency of traditional computer communication architectures, and protocols, by several orders of magnitude.
Mahowald and Sivilotti proposed using an address-event
representation to transmit pulses, or spikes, from an array
of neurons on one chip to the corresponding location in
an array on a second chip [9], [4], [10]. In their scheme,
depicted in Figure 1, an address-encoder generates a unique
binary address for each neuron whenever it spikes. A bus
transmits these addresses to the receiving chip, where an
address decoder selects the corresponding location.
Eight years after Mahowald and Sivilotti proposed it,
the address-event representation (AER) has emerged as the
leading candidate for communication between neuromorphic chips. Indeed, at the NSF Neuromorphic Engineering Workshop held in June/July 1997 at Telluride CO, the
AER Interchip Communication Workgroup was in the top
two—second only to Mindless Robots in popularity [11]!
The performance of the original point-to-point protocol has been greatly improved. Efficient hierarchical arbitration circuits have been developed to handle one- and
two-dimensional arrays [12], [13], [14]. Sender and receiver
interfaces have been combined on a single chip to build
a transceiver [15]. Support for multiple senders and receivers [16], [15], [17], one-dimensional nearest-neighbor–
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
Address
Encoder
Digital
Bus
1
2
3
Address
Decoder
1
1 2313 32 1 2
Time
2
3
Fig. 1. The Address-Event Representation
Pulses from spiking neurons are transmitted serially by broadcasting addresses on a digital bus. Multiplexing is transparent if the
encoding, transmission, and decoding processes cycle in less than
∆/n seconds, where ∆ is the desired spike-timing precision and n
is the maximum number of neurons that are active during this time.
Adapted from [4].
connected network topologies [18], reprogrammable connections, and projective or receptive fields [19], [15], [17]
has been added. Laboratory prototypes with 20,000
neurons and 120,000 AER-based connections have been
demonstrated [19]. Systems with a million neurons and
a billion connections are on the drawing board. In the
near future, we are bound to see large-scale neuromorphic
systems that rewire themselves—just like neural systems
do—by taking advantage of the dynamically reprogrammable virtual wiring [20] made possible by AER.
In this paper, my goal is to provide a tutorial introduction to the design of AER–based interchip communication channels. The remainder of the paper is organized
as follows. I introduce a simple model of neural population activity in Section II, which I use to quantify tradeoffs
faced in communication channel design in Section III. This
section is divided into four subsections that cover bandwidth allocation (Section III-A), channel access protocols
(Section III-B), queuing (Section III-C), and throughput
requirements (Section III-D). Having motivated an approach to inter-chip communication, I introduce the reader
to a formal design methodology for asynchronous digital
VLSI CMOS systems, and describe an AER communication channel implemented using this methodology in Section IV. This section is divided into four subsections that
cover pipelining (Section IV-A), arbitration (Section IVB), row–column organization (Section IV-C), and analog–
digital interfaces (Section IV-D). I review the performance
of three generations of designs in Section V and summarize
the paper in Section VI. Parts of this work have been described previously in conference proceedings [21], [13], in a
magazine article [22], and in a book chapter [14].
II. Neural Population Activity
Neuromorphic systems use the same currency of information exchange as the nervous system—fixed-height, fixedwidth, pulses that encode information in their time of occurrence. Timing precision is measured by latency and
temporal dispersion. Neuronal latency is the time interval between stimulus onset and spiking; it is inversely proportional to the strength of the stimulus. Neuronal temporal dispersion is due to variability between individual
neurons; it is also inversely proportional to the strength
of the stimulus. When messages are transmitted to reveal
locations—or identities—of neurons that are spiking, the
101
communication channel’s finite latency and temporal dispersion add systematic and stochastic offsets to spike times
that reduce timing precision.
Although a fairly general purpose implementation was
sought, our primary motivation for developing a communication channel is to read spike trains off neuromorphic
chips with thousands of spiking neurons, organized into
two-dimensional arrays—such as silicon retinas [23], [24]
or silicon cochlears [25], [26]. Neuronal activity is shaped
by the preprocessing that occurs in the sensory epithelium, which is designed to eliminate redundancy and encode information efficiently [27], [28]. We optimized the
channel design for the resulting neuronal population activity, and sought an efficient and robust implementation that
supports adaptive pixel-parallel quantization. This design
should be well-suited to higher-level neuromorphic processors in so far as they use code information efficiently.
A. Efficient Coding in the Retina
The retina converts spatiotemporal patterns of incident
light into spike trains. Transmitted over the optic nerve,
these discrete spikes are converted back into continuous signals by dendritic integration in postsynaptic targets. Retinal processing maximizes the information carried by these
spikes. Sampling at the Nyquist rate, conventional imagers
require 40Gb/s to match the eyes’ photopic range (17bits),
spatial resolution (60cycles/◦), temporal resolution (10Hz),
and field of view (2 × 90◦ × 90◦ ). In contrast, coding 2bits
of information per spike [29], the million-axon optic nerve
transmits just 40Mb/s—a thousand times less!
The retina has evolved exquisitely adaptive filtering and
sampling mechanisms to improve coding efficiency, six of
which are highlighted below:
1. Local automatic gain control at the photoreceptor- [30]
and network-level [24], [31] eliminates the dependence
on lighting—the receptors respond to contrast instead.
Adapting locally extends the retina’s input dynamic range
without increasing its output range.
2. Bandpass spatiotemporal filtering in the outer plexiform
layer (or OPL, the retina’s first stage) [24] passes an intermediate range of spatial frequencies or temporal frequencies. Rejecting low frequencies reduces redundancy, and
rejecting high frequencies reduces noise.
3. Highpass temporal and spatial filtering in the inner plexiform layer (or IPL, the retina’s second stage) [31] suppresses the OPL’s strong low temporal-frequency response
at its peak spatial frequency (i.e., sustained response to static edge) and its strong low spatial-frequency response at
its peak temporal frequency (i.e., blurring of moving edge).
4. Half-wave rectification in on and off output cell
types [31] eliminates the elevated neurotransmitter-release
and spike-firing rates required to signal both positive and
negative signal excursions using a single channel. on/off
encoding is used in bipolar cells (the OPL–to–IPL relay
cells) as well as in ganglion cells (the retina’s output cells).
5. Phasic transient–sustained response in the ganglion
cells [32] avoids temporal aliasing by transmitting rapid
transients using brief spike-bursts, and eliminates redun-
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
x 10-10
x 10-10
4.4
102
60
25
4.2
4.0
20
3.8
15
3.6
2.2
10
1.9
2.0
5
1.7
1.8
4.2
4.0
50
40
3.8
30
2.1
20
0
0
1.5
(a)
1.6
10
0.5
1
1.5
2
2.5
3
x 10-3
0
(b)
0
0.2
0.4
0.6
0.8
1
1.2 1.4
x 10-3
5
4
3
(a)
2
1
0
0
2
4
6
8 10 12 14 16 18 20
Time (S)
x 10-2
0
0
(b)
2
4
6
8 10 12 14 16 18 20
x 10-2
Fig. 2. Adaptive Silicon Neuron’s Step Response
(a) Spike-Frequency Adaptation. Top: The integrator’s output
current builds up each time the neuron spikes, modeling calciumdependent potassium channels. Middle: The membrane voltage
charges from reset (1.5V) to threshold (2.2V), driven by the difference between input and integrator currents. Bottom: Spikes generated each time the membrane voltage reaches threshold. (b) TimeConstant Adaptation. The membrane voltage repolarizes rapidly because the integrator’s output is temporarily shut off when the neuron
is reset, modeling voltage-dependent potassium channels. Thus, a
tight burst of spikes is generated and adaptation is rapid.
dant sampling by transmitting slow fluctuations using a
low sustained firing rate. Figure 2 shows responses of silicon analogues of ganglion cells.
6. Foveated architecture, and precise, rapid, eye-movements,
provides the illusion of high spatial and temporal resolution everywhere, while sampling coarsely in time centrally
and coarsely in space peripherally [33].
Since retinal neurons are driven by intermediate spatial
and temporal frequencies—and are insensitive to low spatial and temporal frequencies—small subpopulations tend
to fire together. Such correlated, but sparse, activity
arises because the neurons respond to well-defined object
features—and adapt to the background. There is also evidence that gap-junction coupling between ganglion cells
makes neighboring cells more likely to fire in synchrony
[34], [35], and these coincident spikes drive downstream
neurons more effectively [36]. I introduce the concept of
a neuronal ensemble in the next section to capture this
stimulus-driven, fine, spatiotemporal structure.
B. The Neuronal Ensemble
We can describe the activity of a neural population by
an ordered list of locations in spacetime,
E
=
{(x0 ; t0 ), (x1 ; t1 ), . . . (xi ; ti ), . . .};
t 0 < t1 < · · · ti < · · · ,
where each coordinate specifies the occurrence of a spike
at a particular location, at a particular time. The same
location can occur in the list several times but a particular
time can occur only once—assuming time is measured with
infinite resolution.
Fig. 3. Adaptive Silicon Neuron’s Latency Distribution
The time taken to respond to a 15% step increase in input current
was measured 1000 times. (a) Spike-Frequency Adaptation: The first
spike is distributed more or less uniformly, with a slight tendency
toward shorter latencies. The median is 1.3ms and the firing rate immediately after the step (inferred from the longest latency of 2.63ms)
is 380Hz, compared with a steady-state firing rate of 38.4Hz. The bin
size was 33.3µs. (b) Time-Constant Adaptation: The distribution is
heavily skewed toward shorter latencies. The median is 40µs and the
peak firing rate is 28.1KHz, compared with a firing rate immediately
after the step of 714.3Hz and a steady state firing rate of 62.5Hz. The
bin size was 2µs.
There is no need to record time explicitly if the system
that is logging this activity operates on it in real-time—
only the location is recorded and time represents itself. In
that case, the representation is simply:
E = {x0 , x1 , . . . xi , . . .}; t0 < t1 < · · · ti < · · · .
This real-time code is called the address-event representation (AER) [9], [10].
E has a great deal of underlying structure that arises
from events occurring in the real world, to which the neurons are responding. The elements of E are clustered at
temporal locations where these events occur, and are clustered at spatial locations determined by the stimulus pattern. Information about stimulus timing and pattern can
therefore be obtained by extracting these clusters. Also,
E has an unstructured component that arises from noise
in the signal and in the system, and from differences in
gain and state among the neurons. This stochastic component limits the precision with which the neurons can encode
information about the stimulus. I call these statisticallydefined clusters neuronal ensembles.
The probability distributions that describe these neuronal ensembles may be determined by characterizing a
single neuron—assuming its state is randomized from trial
to trial, just like the state is randomized across the population. I measured how long it takes the adaptive silicon neuron to fire after I made a step change in its input current,
repeated over several trials (described in [5], [32]). The results obtained with spike-frequency adaptation and timeconstant adaptation—implemented by modeling calciumand voltage-dependent ion channels in real neurons—are
shown in Figure 3.
The median of the distribution may serve as a measure
of neuronal latency. It gives the expected latency if the target neuron’s threshold equals 50% of the spikes in the ensemble. Unlike the simple integrate-and-fire neuron, whose
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
latency is half its interspike interval, adaptive neurons have
latencies that are much shorter than their steady-state interspike interval, as shown in Figure 2. I define the ratio between the firing rate immediately after the step and
the firing rate in steady state as the frequency adaptation, γ [5]. The measurements shown in Figure 2a and
Figure 3a yield γ = 9.9 for a silicon neuron that models
calcium-dependent potassium channels. And I define the
ratio between the height of the peak in the distribution
and the height of the uniform distribution over the same
interval as the synchronicity, ξ [5]. The measurements
shown in Figure 3b yield ξ = 39.4 for a silicon neuron that
models voltage-dependent potassium channels as well as
calcium-dependent ones.
In addition to characterizing the neuron’s spike-timing
precision relative to its steady-state firing rate, frequency
adaptation and synchronicity allow us to compute its
throughput requirements. Frequency adaptation gives the
spike rate for neurons that are not part of the neuronal
ensemble—assuming that these neurons have adapted.
And synchronicity gives the peak spike rate for neurons
in the ensemble. Throughput must exceed the sum total
spike rate for these two segments of the population. I derive
formulae for computing channel capacity requirements, as
a function of tolerable percentage errors in spike rate and
neuronal latency, in the next section.
103
TABLE I
Time-Multiplexed Communication Channel Design Options
Spec
Encoding
Latency
Integrity
Dispersion
Capacity
Approaches
Amplitude
Width
Code
Timing
Polling
Event-driven
Rejection
Arbitration
Dumping
Queuing
Hard-wired
Pipelined
Remarks
Long settling time, static power
Capacity ∝−1 width
Inefficient for precision < 6bits
Minimum-width, rail-to-rail
∝ Number of neurons
∝ Active fraction
Collisions increase exponentially
Queue events
No waiting
∝−1 surplus-capacity
Simple ⇒ Short cycle time
∝−1 slowest-stage
Given an information coding strategy, the communication channel designer faces several tradeoffs. Should he
preallocate the channel capacity, giving a fixed amount to
each user, or allocate capacity dynamically, matching each
user’s allocation to its current needs? Should she allow
users to transmit at will, or implement elaborate mechanisms to regulate access to the channel? And how does the
distribution of activity over time and over space impact
these choices? Can he assume that users act randomly, or
are there significant correlations between their activities?
I shed light on these questions in this section, and provide
some definitive answers.
III. Tradeoffs in Channel Design
Four important performance criteria for a communication channel that provides virtual point-to-point connections between neuronal arrays are:
Capacity: The maximum rate at which spikes can be
transmitted. It is equal to the reciprocal of the minimum
communication-cycle time.
Latency: The median of the distribution of time intervals
between spike generation in the sending population and
spike reception in the receiving population.
Temporal Dispersion: The standard deviation of the
latency distribution.
Integrity: The fraction of spikes that are delivered to
the correct destination.
All four criteria together determine the throughput,
which is defined as the usable fraction of the channel capacity. Because, the load offered to the channel must be
reduced to achieve more stringent specifications for latency,
temporal dispersion, and integrity.
Channel performance is affected by the information coding strategy used. Some alternatives to fixed-height, fixedwidth, pulses are listed in Table I, together with their pros
and cons. The choices made in this work are set in boldface. Murray and Tarassenko explore the use of various
pulse-stream representations to implement abstract models of neural networks [37], and Reyneri has analyzed and
compared the performance of various pulse coding strategies [38]. However, little attention has been paid to using precise spike timing and neuronal ensembles to encode information—despite increasing neurobiological evidence in support of such coding schemes [39], [40].
A. Allocation: Dynamic or Static?
We may use adaptive neurons that sample at fNyq when
the signal is changing, and sample at fNyq /Z when the
signal is static, where Z is a prespecified attenuation factor.
Let the probability that a given neuron samples at fNyq be
a. That is, a is the active fraction of the population.
Then, each quantizer generates bits at the rate
fbits = fNyq (a + (1 − a)/Z) log2 N,
because a percent of the time, it samples at fNyq ; the remaining (1 − a) percent of the time, it samples at fNyq /Z.
Furthermore, log2 N bits are used to encode the neuron’s
location, using AER, where N is the number of neurons.
On the other hand, we may use conventional quantizers that sample every location at fNyq , and do not locally
adapt their sampling rate. In that case, there is no need to
encode location explicitly. We simply poll all N locations,
according to a fixed sequence, and infer the origin of each
sample from its temporal location. As the sampling rate is
constant, the bit-rate per quantizer is simply fNyq .
The multiple bits required to encode identity are offset
by the reduced sampling rates produced by local adaptation when activity is sparse. In fact, adaptive sampling
produces a lower bit rate than fixed sampling if
a < (Z/(Z − 1))(1/ log2 N − 1/Z).
For example, in a 64 × 64 array of neurons with sampling
rate attenuation Z = 40, the active fraction, a, must be
less than 6.1 percent.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
0.01
γ=320
0.1
1.
0
1000
1000
γ=100
γ=32
100
100
γ=10
γ=3.2
10
0.001
0.2
0.4
0.6
0.8
1
17.5
Throughput (%)
Sampling Rate (/sec)
0.001
104
17.5
15.
15.
12.5
12.5
10.
10.
7.5
7.5
5.
5.
2.5
γ=1
10
0.01
0.1
Active Fraction
0
1.
0
0
Fig. 4. Effective Nyquist Sampling Rate versus Active Fraction
Plotted for various frequency adaptation factors (γ), with throughput
fixed at 10 spikes/sec/neuron. As the active fraction increases the
channel capacity must be shared by a larger number of neurons, and
hence the sampling rate decreases. It falls precipitously when the
active fraction equals the reciprocal of the adaptation factor.
It may be more important to minimize the number of
samples produced per second—instead of minimizing the
bit rate—as there are usually sufficient I/O pins to transmit all the address’ bits in parallel. In that case, it is the
number of samples per second that is fixed by the channel
capacity. Given a certain fixed throughput, Fch , in samples
per second, we may compare the effective sampling rates,
fNyq , achieved by various sampling strategies.
Adaptive neurons allocate channel throughput dynamically in the ratio a : (1 − a)/Z between active and passive
fractions of the population. Hence
fNyq = fch /(a + (1 − a)/Z),
2.5
(1)
where fch ≡ Fch /N is the throughput per neuron. The average neuronal ensemble size determines the active fraction,
a, and frequency adaptation and synchronicity determine
the attenuation factor, Z, assuming neurons that are not
part of the ensemble have adapted. Figure 4 shows how
the sampling rate changes with the active fraction for various frequency adaptation factors, Z = γ. For small a and
Z > 1/a, the sampling rate may be increased by a factor
of at least 1/2a.
In a retinomorphic system, spatiotemporal bandpass filtering and half-wave rectification make output activity
sparse [32], yielding active fractions of a few percent. Assuming a = 0.05, Z = 1 gives fNyq = fch for the integrateand-fire neuron; Z = γ = 10 gives fNyq = 6.9fch for the
neuron with frequency adaptation; and Z = γξ = 450 gives
fNyq = 19.1fch when the membrane time-constant adapts
as well.
B. Access: Arbitered or Unfettered?
Contention occurs if two or more neurons attempt to
transmit simultaneously when we provide random access
to the shared communication channel. We can simply detect and discard samples corrupted by collision [41]. Or
we can introduce an arbiter to resolve contention and a
queue to hold waiting neurons [9], [10]. Unfettered access
shortens the cycle time, but collisions increase rapidly as
0.2
0.4
0.6
0.8
Collision Probability
1
Fig. 5. Throughput versus Collision Probability
Throughput attains a maximum value of 18% when the collision probability is 0.64, and the load is 50%. Increasing the load beyond this
level lowers throughput because collisions increase more rapidly than
the load does.
the load increases. Whereas arbitration lengthens the cycle time, reducing the channel capacity, and queuing causes
temporal dispersion, degrading timing information.
Assuming the spiking neurons are described by independent, identically distributed, Poisson point processes, the
probability of k spikes being generated during a single communication cycle is given by
P (k, G) = Gk e−G /k!,
where G is the expected number of spikes. G = Tch /Tspk ,
where Tch is the cycle time and Tspk is the mean interval
between spikes. By substituting 1/Fch for Tch , where Fch
is the channel capacity, and 1/(N fnu) for Tspk , where fnu
is the mean spike rate per neuron and N is the number of
neurons, we find that G = N fnu /Fch . Hence, G is equal to
the offered load.
We may derive an expression for the collision
probability—a well-known result from communications
theory—using the probability distribution P (k, G) [8]. To
transmit a spike without a collision, the previous spike must
occur at least Tch seconds earlier, and the next spike must
occur at least Tch seconds latter. Hence, spikes are forbidden in a 2Tch time interval, centered around the time that
transmission starts. Therefore, the probability of the spike
making it through is P (0, 2G) = e−2G , and the probability
of a collision is
pcol = 1 − P (0, 2G) = 1 − e−2G .
The unfettered channel must operate at high error rates
to maximize channel utilization. The throughput is S =
Ge−2G , since the probability of a successful transmission
(i.e., no collision) is e−2G . Throughput may be expressed
in terms of the collision probability
1
1 − pcol
ln
;
(2)
S=
2
1 − pcol
this expression is plotted in Figure 5. The collision probability exceeds 0.1 when throughput reaches 5.3%. Indeed,
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
C. Latency: Queue new or Dump old?
What about the timing errors introduced by queuing in
the arbitered channel? For an offered load of 95 percent,
the collision probability is 0.85. Hence, collisions occur frequently and neurons are most likely to spend some time
in the queue. By expressing these timing errors as percentages of the neuronal latency and temporal dispersion,
we can quantify the tradeoff between queuing new spikes,
to avoid losing old spikes, versus dumping old spikes, to
preserve the timing of new spikes.
To find the latency and temporal dispersion introduced
by the queue, we use well-known results from queuing theory which give moments of the waiting time, wn , as a function of moments of the service time, xn [42]:
w=
λx2
,
2(1 − G)
w2 = 2w2 +
λx3
;
3(1 − G)
where λ is the arrival rate of spikes. These results hold
when spikes arrive according to a Poisson process. With
x = Tch and λ = G/Tch, the mean and the variance of the
cycles spent waiting are given by
m≡
G
2
w
w2 − w 2
2
, σm
=
≡
= m2 + m.
2
Tch
2(1 − G)
Tch
3
We have assumed that the service time, x, always equals
n
.
Tch , and therefore xn = Tch
We find that at 95-percent capacity, for example, a sample spends 9.5 cycles in the queue, on average. This result
agrees with intuition: As every twentieth slot is empty,
one must wait anywhere from 0 to 19 cycles to be serviced, which averages out to 9.5. Hence the latency is 10.5
cycles, including the additional cycle required for service.
The standard deviation is 9.8 cycles—virtually equal to the
latency. In general, this is the case whenever the latency
is much more than one cycle, resulting in a Poisson-like
distribution for the wait times.
We can express the cycle-time, Tch , in terms of the neuronal latency, µ, by assuming that Tch is short enough to
transmit half the spikes in an ensemble in that time. That
is, if the ensemble has NE spikes and its latency is µ, the
cycle time must satisfy µ/Tch = (NE /2)(1/G), since 1/G
cycles are used to transmit each spike, on average, and half
of them must be transmitted in µ seconds. Using this relationship, we can express the wait time as a fraction of the
neuronal latency:
G 2−G
(m + 1)Tch
=
eµ ≡
µ
NE 1 − G
The timing error is inversely proportional to the number of
neurons because the channel capacity grows with population size. Therefore, the cycle time decreases, and there is
0
100.
Throughput (%)
the unfettered channel utilizes a maximum of only 18% of
its capacity. Therefore, it offers higher transmission rates
than the arbitered channel only if it is more than five times
faster. Since, as we shall show next, the arbitered channel
operates fine at 95% capacity.
105
2.
4.
6.
8.
NE =10000
NE =1000
80.
NE =320
NE =100
60.
10.
100.
80.
60.
40.
40.
20.
20.
0
0
2.
4.
6.
8.
Channel Latency (%)
0
10.
Fig. 6. Throughput versus Normalized Channel Latency
Plotted for different neuronal ensemble sizes (NE ). Higher throughput is achieved at the expense of latency because queue occupancy
goes up as the load increases. These wait cycles become a smaller
fraction of the neuronal latency as the population size increases, because cycle-time decreases proportionately.
a proportionate decrease in queuing time—even when the
number of cycles spent queuing remains the same.
Conversely, given a timing error specification, we can
invert our result to find out how heavily we can load the
channel. The throughput, S, will be equal to the offered
load, G, since every spike is transmitted eventually. Hence,
the throughput is related to channel latency and population
size by
s
!
e 2
1
1
eµ
µ
+
−
+ 2 ,
S=N
2
NE
2
NE
when the channel capacity grows linearly with the number
of neurons. Figure 6 shows how the throughput changes
with the channel latency. It approaches 100% for large
timing errors and drops precipitously for low timing errors,
going below 95% when the normalized error becomes less
than 20/NE . As NE = aN , the error is 400/N if the active
fraction, a, is 5%. Therefore, the arbitered channel can
operate close to capacity with timing errors of a few percent
when population size exceeds several tens of thousands.
D. Predicting Throughput Requirements
Given a neuron’s firing rate immediately after a step
change in its input, fa , we can calculate the peak spike rate
of active neurons and add the firing rate of passive neurons
to obtain the maximum spike rate. Active neurons fire at a
peak rate of ξfa , where ξ is the synchronicity. And passive
neurons fire at fa /γ (assuming they have adapted), where
γ is the frequency adaptation. Hence, we have
Fmax = aN ξfa + (1 − a)N fa /γ,
where N is the total number of neurons and a is the active
fraction of the population, which form a neuronal ensemble.
We can express the maximum spike rate in terms of the
neuronal latency by assuming that spikes from the ensemble arrive at the peak rate. In this case, all aN neurons
will spike in the time interval 1/(ξfa ). Hence, the minimum latency is µmin = 1/(2ξfa ). Thus, we can rewrite our
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
expression for Fmax as:
Fmax
N
=
2µmin
1−a
a+
ξγ
Intuitively, µmin is the neurons’ timing precision and N (a+
(1 − a)/(ξγ))/2 is the number of neurons that fire during
this time. The throughput must be equal to Fmax , and
there must be some surplus capacity to minimize collision
rates in the unfettered channel and minimize queuing time
in the arbitered one. This overhead is over 455% (i.e.,
(1 − 0.18)/0.18)) for the unfettered channel, but only 5.3%
(i.e., (1 − 0.95)/0.95)) for the arbitered one.
In summary, arbitration is the best choice for neuromorphic systems whose activity is sparse in space and in time,
because we trade an exponential increase in collisions for a
linear increase in temporal dispersion. Furthermore, holding utilization constant (i.e., throughput expressed as a
percentage of the channel capacity), temporal dispersion
decreases as technology advances and we build larger networks with shorter cycle times, even though the collision
probability remains the same. The downside of arbitration
is that it takes up area and time, reducing the number of
neurons that can be integrated onto a chip and the maximum rate at which they can fire. Several effective strategies for reducing the overhead imposed by arbitration have
been developed; they are the subject of the next section.
IV. Arbitered Channel Design
The design of arbitered channels that support point-topoint connections among spiking neurons on different chips
is rather challenging. Early attempts were plagued by timing problems and crosstalk [9], [10]. Fortunately, significant progress has been made in asynchronous digital VLSI
systems in recent years, culminating in the design of a microprocessor that uses no clocks whatsoever by Martin’s
group at Caltech [43]. I apply Martin’s rigorous, correctby-construction, design methodology to the arbitered channel, after introducing the program-based philosophy and
notation it employs. Crosstalk, the pitfall of mixed analogdigital (MAD) system design, must also be addressed to
achieve reliable and robust operation.
Martin’s formal synthesis methodology enables us to design an asynchronous VLSI circuit by compiling a high-level
specification, written in the Communicating Hardware
Processes (CHP) language, into a Production Rule
Set (PRS) [46], [45], [47]. A production rule evaluates
a boolean expression in real time, and sets or clears a bit
when the expression becomes true; it is straightforward to
implement with MOS transistors. The synthesis procedure
involves two intermediate steps: program decomposition
and handshaking expansion.
Through Program Decomposition (PD), which involves decomposing the high-level specification into concurrent subprocesses, we:
• Reduce logical complexity by divide-to-conquer
• Share expensive hardware resources
At this level, we make architectural design decisions that
simplify the design and minimize its hardware require-
106
TABLE II
CHP Language Constructs
Operation
Process
Guard
Sequential
Overlapping
Concurrent
Repetition
Selection
Arbitration
Input
Output
Probe
Data type
Field
Assignment
Notation
Pi
Gi
P1 ; P2
P1 • P2
P1 kP2
∗[P1 ; P2 ]
[G1 []G2 ]
[G1 | G2 ]
A?x
A!x
A
x : int(m)
x.i
y := x
Explanation
A composition of communications
≡ Bi → Pi . Execute Pi if Bi is true
P1 ends before P2 starts
P2 starts before P1 ends, or vice versa
There is no constraint
≡ P1 ; P2 ; P1 ; P2 ; . . .. Repeats forever
Execute Pi for which Bi is true
Required if Bi not mutually exclusive
Read data from port A to register x
Write data from register x to port A
Is communication pending on port A?
Location x is an m-bit register
Register x’s ith bit
Copy data from x to y
ments. We must synchronize concurrent subprocesses and
resolve contention for shared resources.
Ports are used to input data, to output data, or simply
to synchronize—given that processes communicate when
they reach particular points in their programs. Communication is described in CHP simply by writing down the
name of the port, say S. This action may be composed
with other communications using the language constructs
outlined in Table II. A pair of complementary ports, one
active and the other passive, are connected to form a
channel, as shown in Figure 7a. Apart from complementary channel assignments, the only constraint on whether
a port can be active or passive is the probe. A primitive
operation, denoted S, which a process invokes to check if
a communication is pending on its port, S. It returns true
if there is one, and false otherwise. The probe only works
with a passive port, due to implementation constraints.
Through Hand-Shaking Expansion (HSE), which involves fleshing out each communication into a full fourphase handshake cycle, we:
• Choose whether to make a port active or passive
• Reshuffle a communication cycle’s four phases
At this level, we make logic design decisions that reduce
memory and improve speed. PD and HSE produce sequences of waits and actions that define concurrent subprocesses. These sequences are converted into PRS by writing a rule to perform each action when the preceding wait
becomes true.
The four-phase handshake is performed with a pair of
wires, as shown in Figure 7b, and specified using the HSE
primitives described in Table III. The active port initiates
the handshake by asserting the so-called request signal (i.e.,
r+). The probe is implemented by monitoring this signal
(i.e., [r])—an opportunistic implementation that works
only with a passive port. Data is assumed to be valid when
the request signal arrives, which requires their propagation
delays to be matched. The matched delays required by
this bundled-data protocol [44] can be avoided by using a
dual-rail representation, but this delay-insensitive scheme
requires two lines per data bit [45].
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
Sender
(a)
r
a
r
a
d0
d1
d2
d3
d0
d1
d2
d3
TABLE III
HSE Primitives
d0:3
Receiver
r
a
(b)
Fig. 7. Communication Channel Signals and Timing
(a) Data-bus (d0,...,d3) and handshake signals (r and a). (b) Timing Diagram. The sender initiates the sequence by driving its data
onto the bus and taking r high. The receiver acknowledges by taking
a high, after latching the data. These two parties perform complementary sequences of actions and waits: r+;[a];r-;[~
a] for the active sender and [r];a+;[~
r];a- for the passive receiver. The active
port drives the so-called request line while the passive one drives the
so-called acknowledge line.
We can describe an AER transmitter as follows. The
format used to specify a process in CHP is:
name(arguments) ≡ process(ports)
program
end
We name the transmitter process AEXMT(N ), giving the
number of neurons, N , as an argument. And assign it
N dataless ports, named Ln , to service N neurons, and a
single output port, named A, that writes (represented by
!) a dlog2 (N )e-bit integer.1 All this information is specified
in the header:
AEXMT(N ) ≡
process(L1 , L2 , . . . , LN , A!int(dlog2 (N )e))
program
end
Now, we write a program that probes the L-ports to
detect communications initiated by neurons that are spiking, and arbitrates (represented by |) between them. It
then communicates on the chosen port and transmits its
address; these operations may occur concurrently (represented by k). The code for this algorithm is:
∗[[L1 → A!enc(1)kL1 | · · · | LN → A!enc(N )kLN ]]
A function, enc(n), which converts a one-hot code into a
binary one, is invoked to encode the chosen port’s address.
The inner brackets delimit arbitration while the outer ones
delimit repetition, together with the asterix.
Similarly, we can describe an AER receiver in CHP as follows. The receiver uses a dlog2 (N )e-bit input port, named
A, to read (represented by ?) address-events. And, it uses
N dataless ports, named Rn , to service N neurons. Thus,
we have:
AERCV(N )
≡ process(A?int(dlog2 (N )e), R1 , R2 , . . . , RN )
b : int(dlog2 (N )e)
c : int(N )
∗[A?b; c := dec(b); [c.1 → R1 [] · · · []c.N → RN ]]
end
1 dxe
gives the smallest integer larger than, or equal to, x.
107
Operation
Signal
Complement
And
Or
Set
Clear
Wait
Sequential
Concurrent
Repetition
Notation
v
v
~
v & w
v | w
v+
v[v]
[u];v+
v+,w+
*[...]
Explanation
Voltage on a node
Inversion of v
High if both are high
Low if both are low
Drive v high
Drive v low
Wait till v is high
≡ u -> v+ in PRS
≡ v+,w+ in PRS
Just like in CHP
A function, dec(b), which converts from binary to onehot, is invoked to decode the address. b and c are local
dlog2 (N )e-bit and N registers, respectively, used to store
the input and the result. The receiver communicates on the
port corresponding to the one set bit, c.i, in the one-hot
code, c. This port is chosen by selection (represented by
[]), which is used when there is no need for arbitration (i.e.,
the choice is unique). The inner brackets delimit selection
while the outer ones delimit repetition.
A. Pipelining
Pipelining, a well-known approach to increasing throughput, reduces the time-overhead of arbitration by breaking
the communication cycle up into a sequence of smaller steps
that execute concurrently. Concurrency reduces the cycletime to the length of the longest step, with several addressevents in various stages of transmission at the same time,
as shown in Figure 8. Handshaking makes pipelining, and
queuing, straightforward: You can stall a pipeline stage—
or make a neuron wait—or simply by refusing to acknowledge it. To become conversant with the synthesis procedure, let us design a handshake circuit to coordinate the
request and acknowledge signals of adjacent stages in a
pipeline and control data transfer.
A data-buffer pipeline-stage (also called a FIFO, for firstin, first-out) is described in CHP as:
LRBUF ≡
process(L?int(4), R!int(4))
b : int(4)
∗[L?b; R!b]
end
It reads in a nibble from its L port, turns around, and
writes out the nibble on its R port. We make L passive
and R active, which allows buffers to be cascaded, and label
these ports’ request and acknowledge signals as shown in
Figure 9a. Thus, we obtain the following HSE,
*[[li]; lo+; [~
li]; lo-; ro+; [ri]; ro-; [~
ri]],
simply by replacing each communication with a full fourphase handshaking sequence.
In the pipelined example shown in Figure 8b, the first
half of R occurs after the first half of L; the second half of
R remains at the end. That is:
*[[~
ri & li]; lo+, ro+; [ri & ~
li]; lo-, ro-;]
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
Neuron
Arbiter Encoder Decoder Neuron
Neuron Arbiter Encoder
108
Decoder Neuron
~ri
li
~lo
lrbuf
(b)
(a)
Set
Reset
Set
Reset
Fig. 8. Pipelined Communication Cycle
(a) Communication cycle involving four-phase handshakes between
sending neuron, arbiter, address-encoder, address-decoder, and receiving neuron. White and black boxes indicate the duration of the
set and reset halves. Preceeding or succeeding cycles are in dashedlines. (b) In the pipelined channel, we do not wait for the next stage
to acknowledge us before we acknowledge the previous stage. Similarly, we do not wait for it to withdraw its acknowledge before we
withdraw ours.
I have postponed [~
ri] to the beginning of the next cycle,
making R lazy-active, and merged adjacent waits and adjacent actions. This reshuffling is logically equivalent to
our original CHP program (i.e., L; R), because both dataexchange and synchronization occur during the communication’s first half (the second half conveniently returns the
signals to their original state).
The pipelined sequence operates as follows. When data
arrives (i.e., [li]), we latch it and acknowledge receipt
(lo+). However, we wait until the next stage has transmitted the previous item ([~
ri]) so that we can pass on
the new data (ro+) at the same time. We must keep data
available until we get an acknowledge ([ri]). Then it is
safe to make the latch transparent again and to withdraw
our request (ro-). However, we wait for the previous stage
to withdraw its request ([~
li]) so that we can withdraw
our acknowledge (lo-) at the same time.
We implement the pipelined sequence by writing production rules that perform actions in the sequence when
preceding waits become true:
ri & li -> lo+,ro+,t~
ri & ~
li -> lo-,ro-,t+
I have added a strobe signal, t, which makes the latch
opaque when low, and transparent when high. Sometimes,
we have to strengthen guards to prevent rules from misfiring
or interfering [45], [47], but this is not required here.
The handshake logic for our passive–active data buffer is
realized by a single gate, which acknowledges the previous
stage, sends a request to the next stage, and strobes the
latch, as shown in Figure 9b. The pull-down implements
the first rule and the pull-up implements the second one.
Setup and hold times may be satisfied by delaying the request, relative to the data, and by delaying withdrawing
the data after an acknowledge is received [44].
In the unpipelined example shown in Figure 8a, on the
otherhand, the first half of R occurs immediately after
li
lo
ro
ri
l1i
l2i
l3i
l4i
r1i
r2i
r3i
r4i
(a)
ro
t
~t
l1i
(b)
r1o
Fig. 9. Data-Buffer Pipeline Stage
(a) HSE Description: Each port is fleshed out into a pair of handshake
lines and a set of data lines, and assigned an active or passive role.
(b) Circuit Description: The data-buffer consists of a latch and a
C-element—a gate whose output goes high when both inputs are
high and goes low when both inputs are low. As its output is not
always actively driven, a weak feedback inverter, called a staticizer,
is added to hold state.
[li]. That is:
*[[li]; ro+; [ri]; lo+; [~
li]; ro-; [~
ri]; lo-]
Communications intertwined in this way are specified by
the bullet (i.e., L • R) in CHP. This HSE is implemented
by the following PRS:
li -> ro+
ri -> lo+
~
li -> ro~
ri -> loA pair of wires, connecting li to ro and ri to lo, suffices!
We have to give up this simplicity to gain the speed-up
offered by pipelining. Additional speed improvements may
be made by exploiting locality in the arbiter and in the
array, as shown in the next two subsections.
B. Arbitration
Arbitration may be performed by a recursive procedure:
1. Divide the neurons into two groups.
2. Choose one group, making sure there is an active neuron
in the group you choose.
3. If the chosen subgroup has more than one neuron, repeat
Steps 1 and 2 with this group.
4. Else, you are done.
Dividing by two balances the sizes of the subgroups, giving
neurons in each subgroup equal chances of being picked.
In CHP, our recursive arbitration procedure is described
by the recursive equation:
ARB(X) ≡ ARB(X/2)kARB(2)kARB(X − X/2),
where ARB(X) is an X-input arbiter process, which consists of three subprocesses that run concurrently. These
subprocesses are connected in a tree-like structure, as
shown in Figure 10a. The recursion unwinds at ARB(2)
or ARB(1), and hence we only need to design a two-input
arbiter cell—the one-input case is trivial. (N − 1) ARB(2)
cells, connected in a balanced binary tree with blog2 (N )c
levels, are needed to arbitrate between N neurons.2
2 bxc
gives the largest integer smaller than x.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
~l1o l1i
109
l2i ~l2o
a1a
a2a
~a1p
~a2p
ARB(X)
arbt:ARB(X/2)
L1
L1
L2
L2
R
LX/2
LX/2+1
LX/2+2
l1i
arbc:ARB(2)
L1
R
R
L2
LX/2
L1
L2
l2i
~ri
ro
R
LX
LX-X/2
arbb:ARB(X-X/2)
(a)
(b)
~ri
ro
Fig. 10. Recursively-Defined Arbiter
(a) A X-input arbiter is built from a X/2-input arbiter (arbt), a
(X − X/2)-input arbiter (arbb), and a 2-input arbiter (arbc), connected as shown. The X/2- and (X − X/2)-input arbiters are themselves recursively defined by the same procedure. (b) Greedy TwoInput Arbiter Circuit. Requests, l1i and l2i, propagate down
through a modified or gate (bottom), while acknowledges, ~
ri, propagate up through a router (middle). A flip-flop (top) arbitrates between the requests and controls the router, which steers the active-low
acknowledge to the chosen request by noring it with the flip-flops’s
active-low outputs (the source-switched pFETs filter metastable oscillations). A pair of nand gates invert active-high acknowledges from
the steering circuit and blocks them when the outgoing active-high
request, ro, is low.
In CHP, the two-input arbiter cell is described by:
ARB(2) ≡ process(L1 , L2 , R)
∗[[L1 → R; L1 ; L1 ; R | L2 → R; L2 ; L2 ; R]]
end
This process probes its L ports to determine if there are
active neurons in either subgroup. Next, it communicates
on its R port to ensure that the group of neurons it serves
has been chosen. And finally, it communicates on either of
its L ports to select an active subgroup. Thus, requests are
relayed up the tree by probing the R-to-L channels, while
selection is relayed down the tree by communicating on the
same channels. A second pair of L and R communications
terminates the selection. The cell at the top of the tree,
which serves all N neurons, is special, since its group is
always chosen. Thus, communications on its R port are
superfluous. We can connect its R port to a process that
automatically completes the communication (i.e., ∗[L]).
Making L1 and L2 passive, and R active, the synthesis
procedure yielded the circuit shown in Figure 10b. I implemented all paired communications using two halves of
a single four-phase communication. As is normally done, I
used a flip-flop (i.e., cross-coupled nand-gates) to guarantee mutual exclusion by anding one port’s active-high
request with the other’s active-low acknowledge [45].
Fig. 11. Layout of Recursively-Defined Arbiter
A 7-input arbiter, with address-encoder and control cells. It is built
up from a 3-input arbiter (first two cells) and a 4-input arbiter (last
three cells), which are connected together by an additional cell (third
cell), for a total of 6 cells. These 3- and 4-input arbiters are built from
2- and 1-input arbiters—the latter is just a pair of wires that bypass
the lowest level of the tree. An inverter at the top ties the activehigh out-going request back to the active-low incoming acknowledge.
Wells and selects have been omitted for clarity. Gray is substrate;
black and darker shading is M2. The pitch is 70λ and the height is
380λ, or 21µm × 114µm in 0.5µm technology (λ = 0.3µm).
The reshuffling I implemented works as follows. When
a request is received from the lower level (i.e., [l1i]), we
send it to the flip-flop (a1a+). And, without waiting for a
decision, we also relay it to the upper level (ro+). But we
make sure the upper level has cleared its acknowledge to the
previous request first ([~
ri]). If not, we do not make a new
request. Instead, we accept the old acknowledge, assuming
it is stable (ro is high), and relay it to the lower level
(l1o+) as if it was a new acknowledge, once the arbitration
subprocess ([a1p]) acknowledges.
At this point, we are half way through the communication cycle, and every signal is activated. When the lower
level clears its request (i.e., [~
l1i]), we clear our request
to the flip-flop (a1a-). We wait for its acknowledge to
clear ([~
a1p]) before we clear our acknowledge to lower level
(l1o-), preventing a new incoming request from using an
unstable ~
a1p signal. However, we clear our request to the
upper level (ro-) only if both incoming requests have been
cleared. A strategy that allows our sister process to service
her daughters with the old acknowledge.
The modified or-gate’s staticizer and the pull-ups of the
flip-flop’s nand gates must not be too weak. Otherwise,
slow transitions on the incoming request lines (i.e., l1i
or l2i) make the modified or gate’s output oscillate (refer to Figure 10b). This happens when the incoming acknowledge (~
ri) arrives before the outgoing request signal
(ro) completes its transition.3 Because, the pull-up overcomes the staticizer when the active-low acknowledge disables the pull-down. In practice, this occurs only at the top
cell, where the outgoing request is immediately fed back
through an inverter. And, if the pull-ups in the flip-flop’s
3 Tim
Horiuchi discovered this instability.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
nand gates are too weak, the router circuit loads the flipflop, pulling the higher output down. Thus, it reduces the
differential signal, causes both signals to creep downward,
and produces nonmonotonic transitions.
And, two conditions must be met to prevent a lingering
acknowledge from the flip-flop from servicing a new incoming request (refer to Figure 10b):
• ~
a1p—not ~
ri—fires l1o-. Hence, when l1o goes low,
we know that ~
a1p is high. This condition is easily satisfied,
as the number of gates in these two paths differs a lot. The
downward transition on l1i propagates through the flipflop to drive ~
a1p high, but propagates through the modified
or-gate and three gates at the next level to drive ~
ri high.
• ~ l1o—not ro—fires ~
l1o+. Hence, when ~
l1o goes high,
we know that l1o is low. This condition requires careful
transistor sizing, as the number of gates is these two paths
are identical. The downward transition on l1i propagates
through the flip-flop and the nor gate to drive l1o low,
but propagates through the modified or gate, and its inverter, to drive ro low.
Layout for the arbiter-tree is shown in Figure 11. This
layout was generated by implementing the recursive algorithm in a silicon compiler program, starting with layouts
for the two-input arbiter cell. The program was written
in C using the layout-editor’s (L-Edit) user-programmable
interface (UPI) and layout-compilation libraries (L-Comp)
(all from Tanner Research, Inc.). We now turn our attention to reducing the area-overhead of arbitration by tiling
neurons in two-dimensional arrays.
C. Row–Column Organization
By going to a hierarchical X-column–Y -row organization, as proposed in [9], [10], we reduce the number of twoinput arbiter cells from Y × X − 1 to Y + X − 2. That is, it
cost us nothing for the first row or column and one arbiter
cell for each additional
√ row or column. Hence, the areaoverhead scales like N , where N is the number of neurons.
The number of address-encoder and decoder cells are also
reduced by a similar amount—one per row or column, instead of one per neuron. Both sending and receiving neural
populations may be organized into two-dimensional arrays.
C.1 Two-Dimensional Transmitter
Neurons in a two-dimensional AER transmitter are selected by performing hierarchical row-first–column-second
arbitration, as shown in Figure 12. First, we use a Y -input
arbiter to choose one of Y rows, and then we use a Xinput arbiter to choose one of X neurons assigned to that
row. Hierarchical arbitration guarantees that only one row
is active at any time. Hence, we can share a single Xinput column-arbiter between all the rows. We must or
together all requests within each row to generate requests
for the row-arbiter, and all requests within each column to
generate requests for the column arbiter. We save time by
servicing all active neurons in the chosen row before we pick
another row [13], [14]. However, we should not wait for its
inactive neurons to communicate on the column lines. Only
110
Arbiter Tree
Handshaking
X
Req
Ack
Y
C
C
Receiving
Neuron
Sending
Neuron
Sender Chip
Receiver Chip
Fig. 12. Architecture of Address-Event Transmitter and Receiver
The sending neuron’s interface circuit (shown in Figure 13a) communicates spikes to peripheral circuitry using row and column request–
select lines. A row–column controller (Figure 13b) relays requests
from a row or column of neurons to the arbiters and relays the arbiter’s acknowledge back; it also activates the address encoder (Figure 15a). On the receiving end, a pipeline-stage (Figure 9b) reads and
latches the address (X and Y). It acknowledges receipt right way and
activates the address decoders (Figure 15b), which select the corresponding row and column. The receiving neuron (Figure 14) sends an
acknowledge when both its row and column are selected. This signal
is relayed to the pipeline-stage by a two-level wired-or circuit.
neurons that were active at the time that the row was selected must be serviced. This way, inactive neurons cannot
prolong completion indefinitely if they subsequently spike.
This strategy is realized by the neuron-interface and
row–column-control circuits shown in Figure 13, designed
by decomposing AEXMT(N ) into row and column subprocesses, and following the synthesis procedure. The neuron drives the row-request line low (i.e., ~
p-) when a spike
occurs ([lix]). The controller relays this request to the
row arbiter (ro+) and grants the request by driving the rowselect line high (s+) when the arbiter acknowledges ([ri]).
It also activates the row-address–encoder (ao+). If necessary, the controller waits until previous column and encoder
communications are completed ([~
ai]). When the row is
selected, all neurons with spikes place requests on their
column lines (~
cox-) and clear their spikes one by one—by
taking ~
lox low—as these requests are granted ([cix]).
Each neuron releases the row- and column-request lines
when it is serviced. The row-request line, ~
p, goes high only
when all the spikes have been cleared. The row-controller
then withdraws its request to the arbiter (ro-), but it waits
till it receives an acknowledge from the encoder ([ai]),
since this signal prevents interference with on-going communications. As soon as the arbiter clears its acknowledge
([~
ri]), the controller withdraws its request to the encoder
(ao-) and deselects the row (s-). I have strengthened the
guard of ~
lox- to ensure that a neuron is reset only when
its row and its column are selected.
We can use the same control circuit shown in Figure 13b
to interface a column of neurons with the arbiter and
the encoder. The column logic itself consists of a Y input wired-nor gate, which feeds ~
cox into ~
p, and Y aCelements, which steer s to the correct neuron by anding
it with the row-select. We can eliminate the aC-elements,
and broadcast the controller’s acknowledge, since it is al-
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
~lox
~ao
cix ~cox
Vpu
111
axi
~bo
s
s
~p
Vpu
Iin
lix
(a)
~ri
~ryxi
ryi
ao
Vpu
~ai
~ryxo
~p
Vpd
ro
(b)
Fig. 13. Sending-Neuron and Row–Column Control Circuits
(a) Five-transistor interface (right half) between neuron (i.e.,
lix,~
lox) and row- (~
p,s) or column- (~
cox,cix) control circuits. The
pull-downs on ~
p and ~
cox form row- and column-wide wired-nor gates;
current source pull-ups are at the edge of the array. The neuron is
disabled when its row is selected (s is high) to prevent generation of
new spikes. Capacitive positive-feedback in the axon-hillock circuit
(left half) provides hysteresis and speeds up transitions. (b) Interface among row or column (i.e., ~
p,s), arbiter (ro, ~
ri), and addressencoder (ao, ~
ai). These gates are called aC-elements (for asymmetric): their outputs are set when both inputs are high and cleared
when a particular input is low (or vise versa).
ready anded with the row-select signal inside the neuron—
provided we clear it before a new row is selected.
To figure out if a new row can be selected before the
column-select is cleared, note that the row-controller selects a new row after the row-encoder clears its acknowledge (i.e., [~
ai]). And this signal is essentially synchronous
with the column-encoder’s acknowledge, as the encoders
simply relay the receiver’s acknowledge. Furthermore, the
column- (and row-) controller’s select signals must be low
in order for the encoders to clear their requests to the receiver. Hence, it follows that the column-select signal is
cleared before a new row is selected.
Throughput may be boosted by reading the state of all
neurons in a selected row in parallel, and storing their
spikes in a latch on the periphery of the array, where they
can be rapidly relayed to the column arbiter. Stored spikes
are transmitted in a rapid burst, while the array is cycled to select and read the next row. I briefly describe
the performance enhancement achieved by this approach
in Section V; design details may be found in [48]. Let us
now turn our attention to organizing the receiving neurons
into rows and columns.
byi
Fig. 14. Receiver’s Neuron Interface Circuit
Active-high column- and row-select signals, axi and byi, are nanded
together to generate an active-low request, ~
ryxo. The neuron responds with an active-low acknowledge, ~
ryxi. If desired, ~
ryxo may
be tied directly to ~
ryxi to produce a minimum-width active-low pulse.
Active-low acknowledges from all neurons in the same row are nanded
together to generate a active-high row acknowledge, ryi. These rowacknowledges are nored together to produce a single active-low acknowledge that is sent back to the decoders.
We run the risk of choosing the wrong neuron next time
around, when a lingering row select signal is nanded with a
new column select signal—or vise versa. Matching the decoders’ delays minimizes the risk. Asynchronous versions
of traditional circuits used to encode and decode addresses
are shown in Figure 15.
Our review of two-dimensional AER transmitter and receiver design is now complete. We have seen how to reduce the overhead imposed
by arbitration, encoding, and
√
decoding from N to N by organizing neurons into rows
and columns. And how to exploit this organization, together with locality, to reduce the average cycle time. As
shown in Figure 12, and described in [13], [14], a pipelinestage (see Figure 9b) can be inserted between the receiver’s
input port and its decoders to improve its performance.
This slack allows the receiver to acknowledge as soon as it
latches the address from bus, and then decode the address
and select the neuron while the sender is clearing it row or
column select signals and selecting a new row or column.
Having addressed the intricacies of asynchronous logic circuit design, we now turn our attention to the pitfalls of
mixed-analog-digital design.
D. Analog–Digital Interfaces
C.2 Two-Dimensional Receiver
The two-dimensional AER receiver’s structure parallels
that of the transmitter, as shown in Figure 12. First, we
use a dlog2 (Y )e-bit decoder to select one of the Y rows,
and then we use a dlog2 (X)e-bit decoder to select one of
the X output ports assigned to that row.
This strategy is realized by the circuit shown in Figure 14, obtained by decomposing AERCV(N ) into a neuron,
row, and column subprocesses, and following the synthesis
procedure. I changed the gate that combines the row- and
column-selects from a state-holding C-element to a purely
combinational nand gate. Thus, we clear our request to
the neuron when either the column select or the row select is cleared, without waiting for the other line to clear.
Neuromorphic chips are mixed analog-digital (MAD)
systems [49], where sending and receiving neurons serve
as analog-to-digital and digital-to-analog converters. They
use subthreshold analog CMOS circuits to model dendritic
computation [3] and asynchronous digital CMOS logic to
model axonal communication [3], [50]. One of the greatest
difficulties in their design is reducing crosstalk between the
analog and digital parts, given the gigantic differences in
current levels and speeds.
In the analog domain, we use 100pA currents and 100fF
capacitors to achieve 1V/ms slew rates. Whereas in the
digital domain, 100µA currents and 100fF capacitors yield
1V/ns slew rates—a million times higher! To match these
slew rates, the neuron’s gain must exceed one million. And
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
a1i ~a1i
ami ~ami ao
112
axi
axi
bi
b1o
a1i
VA
VA
~a1i
~ryi
ani
~ani
bno
Vw
Sel
Iout
Vpu
ryi
byi
byi
(a)
(b)
Vw
Sel
Iout
ai
bo
(a)
ao
(b)
di~d1o
~dmo
Fig. 15. Address Encoder and Decoder Circuits
(a) A 1-in-m one-hot to n-bit binary encoder, where n = dlog2 (m)e,
is built with m × (n + 1) one-transistor cells. A cell either pulls
up the output line with a pFET, driven by an active-low input line
(i.e., ~
ami), or pulls it low with a nFET, driven by an active-high
input line (ami). An extra output line (bi) that is always pulled
high provides an active-high request signal. (b) A n-bit to 1-in-m
decoder, where m < 2n , is built with (n + 1) × m two-transistor cells.
The n + 1 cells connected to each active-low output (i.e., ~
dmo) form
an (n + 1)-input nand gate, with n + 1 series-connected nFETs and
n+1 parallel-connected pFETs. We decode a zero or a one by driving
the cell with either an active-high signal (ani) or an active-low signal
(~
ani), respectively. The active-high request signal is connected to the
(n + 1)th input. Adapted from [10].
to ensure that less than a 5mV of the 5V digital swing finds
its way into the analog circuitry, parasitic coupling capacitances must be less than 0.1fF! Recalling that a CMOS inverter’s gain is about 10 and a minimum-sized transistor’s
drain-to-gate overlap capacitance exceeds 1fF, you realize
how demanding these specifications are.
To realize a millionfold gain, we use a two-inverter noninverting amplifier with positive feedback, also known as
the axon-hillock circuit. This circuit, shown in Figure 13a,
is named after the spike-initiation zone in a biological neuron [3]. If the loop gain exceeds unity, the output’s rate of
change is limited only by the amplifier’s output current—
not by the input current. Thus, this circuit has an effective
gain of 100,000 or more! Unfortunately, the first inverter,
with its input charging up at 1V/ms, spends 0.5ms within
0.5V of threshold, passing a short-circuit current close to
100µA the whole time. Hence, it consumes a million times
more power than a regular CMOS inverter.
We may limit the axon-hillock’s power dissipation by
starving the first inverter using an nMOS-style pull-up
transistor, which supplies a fixed bias current of about 1µA,
as suggested by Lazzaro [51]. It is unsafe to reduce the current further because this inverter’s output must switch all
the way to Gnd by the time the row is selected (see Figure 13a). Otherwise, the second inverter’s pull-down transistor will clear lix when its pull-up is disabled by s going
high.4 Consequently, this approach reduces the power dissipation to only 10,000 times that of a regular inverter.
4 Charles
Higgins discovered this race condition.
Fig. 16. Bad and Good Receiver Pixels
Both pixels use a diode-capacitor circuit to integrate spikes and a
tilted current-mirror to amplify current, as described in [32]. The
bias voltage Vw sets the amount of charge metered onto the capacitor
each time the pixel is selected. The parasitic capacitors shown can
pump or inject current into the integrator, as explained in the text.
In (b), the pull-up isolates the integrator from these parasitic effects.
Power-supply rails mediate crosstalk. Transistors connected to the rails form a multiple-input differential pair,
and a device transiently steals current from the others when
it is switched on. With the axon-hillock’s input and threshold transistors tied to Gnd and Vdd, respectively, these rails
mediate inhibitory and excitatory interactions, respectively
(see Figure 13a). We can avoid turning off the current in
these transistors by limiting the reset current and turning it off as soon as the spike is reset, as described in [3].
Hence, we can isolate the neuron from the digital circuitry
by moving this inverter to the analog supply, without corrupting the analog supply. However, we must also move the
second inverter’s pull-down to the analog supply to avoid
injecting digital supply noise through the positive-feedback
capacitor (via the second inverter’s pull-down device, which
remains on during the interspike interval). The inhibition
mediated by this pull-down is acceptable, as it tends to desynchronize the neurons. Unlike excitation, which tends to
synchronize the neurons and increase spiking activity.
Parasitic capacitances within a device, due to overlap between gate and source/drain diffusion, can turn on a device
by driving its source outside the supply rail. This problem plagued the first receiver pixel that I designed (shown
in Figure 16a,). Rapid voltage swings on the columnselect line (axi) are transmitted to the source terminal of
the current-source transistor (device with gate tied to Vw),
driving it a fraction of a volt below GND—if the node’s voltage sits close to GND, as it does in this circuit. As a result,
the current source would pass a fraction of a picoamp even
when Vw was tied to GND.
Parasitic capacitances between series-connected devices
can produce charge-pumping. This problem also plagued
the receiver pixel shown in Figure 16a. The pair of transistors controlled by the row and column select lines, byi
and axi, pump charge from the current-source transistor
to ground when non-overlapping pulses occur on the select
lines. For a 20fF parasitic capacitor, a 100Hz average spike
rate per pixel, a 0.5V voltage drop, and 64 neurons per row
or column, the current is 64pA. This current, which scales
with the array size, easily swamps out the subpicoamp cur-
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
113
TABLE IV
Three Generations of Arbitered Channels
Size
64 × 64
64 × 64
104 × 96
Fig. 17. Layout of 2 × 2 Receiver Pixels
Pixels are flipped vertically and horizontally to isolate digital and
analog circuitry and share contacts (see Figure 16b for the circuit). Current-mirror integrators are located centrally, with switched
current-sources (large devices at top and bottom edges), nand-gate
pull-downs (two small series-connected devices tied to the currentsources), and nand-gate and acknowledge pull-ups (devices with Lshaped gates near left and right edges) on the periphery. axi, ryi,
VA, and Iout run vertically in M1, byi runs horizontally in M2, Vw,
Vpu and Sel run horizontally in Poly1. A second Poly2 select line
is used in odd columns to compensate for hexagonal tiling in the
retina chip. A M1 line, tied to Vdd, shield’s the current-source transistor’s drain from the M2 row-select line. Gray is substrate; black
and darker shading is M2. The cell width is 63λ and its height is 46λ,
or 18.9µm × 13.8µm in 0.5µm technology (λ = 0.3µm).
rent levels we must maintain in the diode-capacitor integrator to obtain time constants greater than 10ms using a
300fF capacitor.
Capacitive turn-on and charge-pumping can both be
eliminated by adding a pull-up, which implements an
nMOS-style nand gate, as shown in Figure 16b. A
full CMOS nand gate will also work—it eliminates the
global bias line Vpu but requires an additional transistor.
The pull-up keeps the current-source transistor’s sourceterminal close to Vdd, making it impossible to capacitively
drive it below ground. And it is biased to supply a few microamps, easily overwhelming the pump current. Furthermore, the current-source transistor is switched on by swinging its source terminal from Vdd to GND, a technique that
can meter minute quantities of charge, as demonstrated by
Cauwenberghs [52]. A layout of the receiver pixel is shown
in Figure 17. In the next section, I discuss characterization
procedures for AER communication channels and review
the performance of some existing designs.
V. Test Protocols and Channel Performance
Timing relationships between the control signals must
be kept in mind while debugging these asynchronous interfaces. The diagram in Figure 7b indicates which party must
act at any stage in the sequence, helping us to determine
who is at fault when the channel hangs. For example, if it
hangs with both r and a high, the arrow indicates that the
sender is at fault—it failed to withdraw its request. Testing is facilitated by interfacing sender and receiver chips
with a computer that can read and write addresses at high
Process
2.0µm
2.0µm
1.2µm
Read-Out
Random
Local
Parallel
Cycle-Time
2µs
420-730ns
30-400ns
Throughput
500KS/s [4]
2.0MS/s [13]
25MS/s [48]
Fig. 18. Address-Event Streams Showing Arbiter Scanning
Y and X addresses are plotted on the vertical axes, and their position
in the stream is plotted on the horizontal axes. For load at 5% capacity: (a) Y addresses tend to increase with sequence number. (b)
X addresses are distributed randomly.
speed. Delbrück et al have implemented a MatLab-based
interface on the Mac, using a parallel I/O card from National Instruments [53]. They achieved a transfer rate of
100KHz by programming at the register-level.
The architectural optimizations described earlier reduced cycle-times by more than an order of magnitude,
over three generations of arbitered AER channel designs.
Going from 2µs reported in Mahowald and Sivilotti’s pioneering work, to as low as 30ns reported in [48], where
spikes are readout from the array in parallel. Table IV
summarizes the evolution.
Address-event streams from the local-readout 64 × 64–
neuron transmitter design [14], fabricated in 2µm technology, reveal the arbiters’ greedy behavior. This transmitter uses the architecture shown in Figure 12, and reads
all the spiking neurons in a selected row, sequentially, before it selects another row. The row arbiter rearranges the
Y address, as shown in Figure 18, as it attempts to span
the smallest subtree, going to the nearest row that is active. Such scanning is beneficial because transversing an
additional level added 37ns (estimated) to the cycle-time.
Scanning is not evident in the X addresses because, at the
low load level used, no more than 3 or 4 neurons are active
simultaneously within the same row.
Cycle-time measurements from a parallel-readout 104 ×
96–neuron transmitter [48], fabricated in 1.2µm CMOS
technology, are shown in Figure 19. This transmitter used
an architecture similar to the previous one, except that a
row-wide latch was interspersed between the array and the
column arbiter, and the state of all neurons in a selected
row were read in parallel. The latch’s bit-cells act as slave
neurons: they send requests to the column arbiter while
the neurons in the selected row are reset and another row
is selected. The cycle-time is as low as 30ns when spikes
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
Fig. 19. Cycle-Times for Parallel-Readout Transmitter
Screen dump from a Tektronix TLA704 Logic Analyzer. Top trace
- request, middle - column address, and bottom - row address. The
first and second address-events are from different rows, whereas the
second and third events are from the same row. The cycle time is
362ns in the first case and 72ns in the second case.
are read from the latch, and goes up to 400ns when data is
read from the array. Even this worst case cycle-time, which
involves arbitration in both dimensions, trumps the 730ns
achieved with the earlier local-readout design [13], [14].
Images in Figure 20 show the response of the 104 × 96neuron retinomorphic chip to a light spot [31]. The
address-events were read into a computer, and the images were rendered by modeling temporal integration in
the diode-capacitor integrator. The four types of ganglion
cells on the chip subsample the image by a factor of two;
LSBs of the X and Y addresses encode the cell type.
VI. Discussion and Summary
I have described the design of communication channels
for neuromorphic chips. These designs exploit spatially
sparse, temporally coincident, neural activity, whitened by
preprocessing in the sensory epithelium. Neuronal ensembles capture this stimulus-driven spatiotemporal activity.
They consists of spikes clustered at distinct temporal locations where events occur and sparse spatial locations determined by the stimulus pattern. They also have an unstructured component that arises from noise in the signal,
from noise in the system, and from differences in gain and
state among the neurons. This stochastic component limits
the precision with which neurons encode information about
the stimulus. Neuronal ensembles can be transmitted by
time-division multiplexing without losing information if the
channel’s timing precision exceeds that of the neurons.
A. Design Tradeoffs
For a random-access, time-multiplexed, channel, the
multiple bits required to encode identity are offset by the
reduced sampling rates produced by local adaptation when
activity is sparse. The pay-off is even better when there are
sufficient I/O pins to transmit all the address-bits in parallel. In this case, frequency and time-constant adaptation
allocate bandwidth dynamically in the ratio a : (1 − a)/Z
between active and passive fractions of the population. For
low active fractions, a, and sampling-rate attenuation factors, Z, larger than 1/a, the effective Nyquist sampling rate
may be increased by a factor of 1/2a.
Contention occurs when two or more neurons spike simultaneously, and we must dump old spikes to preserve the
timing of new spikes or queue new spikes to avoid losing old
114
Fig. 20. Response of Retinomorphic Chip
Four ganglion-cell types respond to light or dark spots, either in a
transient (increasing, decreasing) or sustained fashion (on, off).
(a) Light spot stationary (located where the single active Increasing
cell is): Sustained cells pick up increased signal at the spot’s location
and decreased signal surrounding region, due to lateral inhibition.
Fixed-pattern noise, due to transistor mismatch, is also evident. (b)
Light spot moving up and to the right: Transient cells pick up decrease at inhibitory surround’s leading edge and excitatory center’s
leading edge. The mean spike rate was 5 spikes/neuron/sec.
spikes. An unfettered design, which discards spikes clobbered by collisions, offers higher throughput if high spikeloss rates are tolerable. In contrast, an arbitered design,
which makes neurons wait their turn, offers higher throughput when low-spike loss rates are desired. Indeed, the unfettered channel utilizes only 18% of its capacity, at the
most. Therefore, the arbitered design offers more throughput if its cycle time is no more than five times longer than
that of the unfettered channel.
The inefficiency of the unfettered channel design, also
known as ALOHA, has been long recognized, and more efficient protocols have been developed [7]. One popular
approach is CSMA (carrier sense, multiple access), where
each user monitors the channel and does not transmit if
it is busy.5 This channel is prone to collisions only during the time it takes to update its state. Hence, the collision rate drops if the round trip delay is much shorter
than the packet-transmission time—as in bit-serial transmission of several bytes. Its performance is no better than
ALOHA’s, however, if the round trip delay is comparable to the packet-transmission time [7]—as in bit-parallel
transmission of one or two bytes. Consequently, it is unlikely that CSMA will prove useful for neuromorphic systems (preliminary results are reported in [54]).
As technology improves and we build denser arrays with
shorter cycle times, the unfettered channel’s collision probability remains unchanged for the same normalized load,
whereas the arbitered channel’s normalized timing error
decreases. This desirable scaling arises because timing error is the product of the number of wait cycles and the
cycle-time. Consequently, queuing time decreases due to
the shorter cycle times, even though the number of cycles
spent waiting remains the same. Indeed, as the cycle-time
must be inversely proportional to the number of neurons,
N , the normalized timing error is less than 400/N for loads
5 The
Ethernet, and most local-area-networks, work this way.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
below 95% capacity and active fractions above 5%. For
population sizes of several tens of thousands, the timing
errors is just a few percentage points.
For neurons whose timing precision is much better than
their interspike interval, we may estimate throughput requirements by measuring frequency adaptation and synchronicity. Frequency adaptation, γ, gives the spike rate
for neurons that are not part of the neuronal ensemble.
And synchronicity, ξ, gives the peak spike rate for neurons in the ensemble. These firing rates are obtained from
the spike frequency at stimulus onset by dividing by γ and
multiplying by ξ, respectively. The throughput must exceed the sum of these two rates if we wish to transmit the
ensemble without adding latency or temporal dispersion.
The surplus capacity must be at least 455% to account
for collisions in the unfettered channel, but may be as low
as 5.3% in the arbitered channel, with subpercent timing
errors due to queuing.
B. Asynchronous Implementation
N − 1 two-input arbiter cells, arranged in a tree with
blog2 (N )c levels, are required to arbitrate between N neurons, plus N address decoder and encoder
√ cells. This areaoverhead may be reduced from N to N by going to a
hierarchical row–column organization. Time-overhead is
reduced by adopting three strategies.
Pipelining reduces the time-overhead of arbitration by
overlapping communication sequences of the sending neuron, the row-arbiter, the column-arbiter, and the receiving
neuron. We also inserted a pipeline-stage between the receiver chip’s input port and its decoders, allowing it to
acknowledge as soon as it latches the address from bus. It
decodes the address and selects the target neuron, while
the sender is clearing its row or column select signals and
selecting a new row or column.
Exploiting locality in the row–column organization reduces time-overhead further by servicing all active neurons
in the selected row, redoing row arbitration only when no
requests are left. Throughput is boosted further by reading
the state of all neurons in a selected row in parallel, storing
this information in a latch at the periphery of the array,
where it can be readily supplied to the column arbiter. The
spikes are transmitted in a rapid burst, while the array is
cycled to select and read the next row.
Exploiting locality in the arbiter tree also reduces timeoverhead by spanning the smallest subtree that has a pair
of active inputs. We implemented this strategy simply by
making the two-input arbiter cell greedy—it services both
of its daughters if they are active. Thus, the arbiter becomes a scanner when neuronal activity is spatially clustered. It transverses 1 level with probability 1/2, 2 with
probability 1/4, 3 with probability 1/8, and so on—this
series converges to 2 levels.
However, exploiting locality trades fairness for efficiency.
Instead of allowing every active neuron to bid for the
next cycle, or granting service on a strictly first-come–firstserved basis, the transmitter acts like a traveling-salesman
and services the closest customer. It may pick-up a neu-
115
ron that is close-by over neurons that fired earlier. Giving priority to location minimizes the average cycle time,
thereby maximizing the channel capacity and minimizing
the average wait time. Unfortunately, service is limited to
a local area when the channel is overloaded, and neurons
outside that area are simply ignored. Irrespective of the
fairness of the selection mechanism we choose, the average wait time goes to infinity when the channel capacity
is exceeded. Therefore, maximizing channel capacity is my
paramount concern.
Special attention must be paid to digital–analog interfaces in these hybrid designs. It is imperative to use
positive-feedback to match analog and digital slew rates—
hysteresis alone [41] is not enough. However, the axonhillock circuit’s power dissipation must be reduced drastically (possibly by adapting the bias current to the output’s
rate of change) and excitatory coupling through the supply rails carefully isolated. Capacitive turn-on and chargepumping in receiver-neuron interfaces were eliminated by
implementing an nMOS-style nand gate.
We may have to adopt bidirectional current-signaling to
achieve subpercent (< 40dB) capacitive crosstalk. However, this technique, which has been used within [55], [24]
and between [56], [57], [58] chips, must be refined further to
reduce static power dissipation, and the area-overhead in
capacitors for frequency-compensation or in large devices
for better matching.
Following a rigorous design methodology for asynchronous logic circuits has paid off, improving robustness and
reliability considerably. Although the state-of-the-art implementations are by no means bullet-proof [48], they are
getting to the point where nonexperts can use them with
the aid of silicon compilation. Making it possible for
neuromorphic-system designers with limited expertise in
asynchronous communication to successfully incorporate
these channels into their designs [17], [59].
Acknowledgments
This work begun while I was a doctoral student in Carver
Mead’s lab at Caltech, where it was supported by the
ONR; DARPA; the Beckman Foundation; Caltech’s NSF
Engineering Research Center for Neuromorphic Systems;
and the California Trade and Commerce Agency, Office
of Strategic Technology. Its is currently funded by startup funds from Penn’s Schools of Engineering and Applied
Sciences and the School of Medicine (through the IME),
by the NSF’s Knowledge and Distributed Intelligence program, and by the Whitaker Foundation.
I thank my thesis advisor, Carver Mead, for sharing his
insights into nervous system organization. I also thank
Misha Mahowald for making available layouts of the arbiter, the address encoders, and the address decoders; John
Lazzaro, Alain Martin, Jose Tierno, and Tor (Bassen)
Lande for helpful discussions; Tobi Delbrück for help with
the Macintosh AER interface and L-Comp; and Jeff Dickson for help with PCB design. I also thank Tanner Research Inc. for making available to me a pre-release version
of their L-Comp libraries for silicon compilation.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
C A Mead, “Neuromorphic electronic systems,” Proc. IEEE,
[26]
vol. 78(10), pp. 1629–1636, 1990.
E A Vittoz, Analog VLSI Implementation of Neural Networks,
Inst. of Physics Publishing & Oxford University Press, 1995.
[27]
C A Mead, Analog VLSI and Neural Systems, Addison Wesley,
Reading MA, 1989.
M Mahowald, VLSI Analogs of Neuronal Visual Processing:
[28]
A Synthesis of Form and Function, Ph.D. thesis, California
Institute of Technology, Pasadena CA, 1992.
K A Boahen, Retinomorphic Vision Systems: Reverse Engineering the Vertebrate Retina, Ph.D. thesis, California Institute [29]
of Technology, Pasadena CA, 1997.
T S Lande, Ed., Neuromorphic Systems Engineering: Neural
[30]
Networks in Silicon, Kluwer Acad. Pub., Boston MA, 1998.
M Schwartz, Telecommunication Networks: Protocols, Modeling, and Analysis, Addison-Wesley, Reading, MA, 1987.
A S Tanenbaum, Computer Networks, Prentice-Hall Interna- [31]
tional, Upper Saddle River, NJ, 2 edition, 1989.
M Sivilotti, Wiring considerations in Analog VLSI Systems,
with application to Field-Programmable Networks, Ph.D. thesis,
[32]
California Institute of Technology, Pasadena CA, 1991.
M Mahowald, An Analog VLSI Stereoscopic Vision System,
Kluwer Academic Pub., Boston, MA, 1994.
T Sejnowski, C Koch, and R Douglas, Eds.,
Tel- [33]
luride
Workshop
on
Neuromorphic
Engineering.
http://www.klab.caltech.edu/h̃arrison/tell97/report97/index.html,
[34]
1997.
J Lazzaro, J Wawrzynek, M Mahowald, M Sivilotti, and D Gillespie, “Silicon auditory processors as computer peripherals,”
IEEE Trans. on Neural Networks, vol. 4, no. 3, pp. 523–528, [35]
1993.
K A Boahen, “Retinomorphic vision systems ii: Communication [36]
channel design,” in Proc. IEEE Intl. Symp. Circ. and Sys., Piscataway NJ, May 1996, IEEE Circ. & Sys. Soc., vol. Supplement,
[37]
pp. 14–17, IEEE Press.
K A Boahen, “Communicating neuronal ensembles between
neuromorphic chips,” in Neuromorphic Systems Engineering: [38]
Neural Networks in Silicon, T S Lande, Ed., chapter 11. Kluwer
Academic Pub., 1998.
S R Deiss, R J Douglas, and A M Whatley, “A pulse-coded com- [39]
munications infrastructure for neuromorphic systems,” in Pulsed
Neural Networks, W Maass and Bishop C M, Eds., chapter 6,
[40]
pp. 157–178. MIT Press, Boston MA, 1999.
J P Lazzaro and J Wawrzynek, “A multi-sender asynchronous [41]
extension to the address-event protocol,” in 16th Conference on
Advanced Research in VLSI, W J Dally, J W Poulton, and A T
[42]
Ishii, Eds., 1995, pp. 158–169.
C M Higgins and C Koch, “Multi-chip motion processing,” in [43]
20th Anniversary Conference on Advanced Research in VLSI,
D S Wills and S P DeWeerth, Eds., Los Alamitos, CA, 1999,
IEEE Computer Society Press.
S P DeWeerth, G N Patel, M F Simoni, D E Schimmel, and [44]
R L Calabrese, “A vlsi architecture for modeling intersegmental
coordination,” in 17th Conference on Advanced Research in [45]
VLSI, R Brown and A Ishii, Eds., Los Alamitos CA, 1997, pp.
182–200, IEEE Computer Press.
K A Boahen, A G Andreou, T Hinck, J Kramer, and [46]
A Whatley,
“Computation- and memory-based projective
field processors,” in Telluride Workshop on Neuromorphic
Engineering, T Sejnowski, C Koch, and R Douglas, Eds. 1997, [47]
http://www.klab.caltech.edu/h̃arrison/tell97/report97/index.html.
J G Elias, “Artificial dendritic trees,” Neural Computation, vol.
[48]
5, pp. 648–663, 1993.
K A Boahen,
“Retinomorphic vision systems,”
in Microneuro’96: Fifth Intl. Conf. Neural Networks and Fuzzy Systems, Los Alamitos CA, Feb 1996, EPFL/CSEM/IEEE, pp. 2–
[49]
14, IEEE Comp. Soc. Press.
K A Boahen, “A retinomorphic vision system,” IEEE Micro,
vol. 16, no. 5, pp. 30–39, October 1996.
M Mahowald and C A Mead, “The silicon retina,” Scientific [50]
American, vol. 264, no. 5, pp. 76–82, 1991.
K A Boahen and A Andreou, “A contrast-sensitive retina with
reciprocal synapses,” in Advances in neural information processing 4, J E Moody, Ed., San Mateo CA, 1992, vol. 4, pp. 764–772, [51]
Morgan Kaufman.
R F Lyon and C A Mead, “An analog electronic cochlea,” IEEE
116
Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1119–
1134, 1988.
L Watts, Cochlear Mechanics: Analysis and Analog VLSI,
Ph.D. thesis, California Institute of Technology, Pasadena CA,
1993.
M V Srinivasan, S B Laughlin, and A Dubs, “Predictive coding:
A fresh view of inhibition in the retina,” Pro. R. Soc. Lond. B
Biol. Sci., vol. 216, pp. 427–459, 1982.
J Atick and N Redlich, “What does the retina know about
natural scene,” Neural Computation, vol. 4, no. 2, pp. 196–210,
1992.
D K Warland, P Reinagel, and M Meister, “Decoding visual
information from a population of retinal ganglion cells,” J. Neurophysiol., vol. 78, pp. 2336–2350, 1997.
T Delbruck and C A Mead, “Photoreceptor circuit with wide
dynamic range,” in Proc. of the Int. Circ. & Sys. Conf., London,
England, 1994, IEEE Circuits and Systems Society.
K A Boahen, “Retinomorphic chips that see quadruple images,”
in Microneuro’99: Seventh Intl. Conf. for Neural, Fuzzy, &
Bio-Inspired Systems, Los Alamitos CA, April 1996, Spanish
RIG/IEEE, pp. 12–20, IEEE Comp. Soc. Press.
K A Boahen, “The retinomorphic approach: Pixel-parallel adaptive amplification, filtering, and quantization,” Analog Integr.
Circ. and Sig. Proc., vol. 13, pp. 53–68, 1997.
R Etienne-Cummings, J Van der Spiegel, P Mueller, and
M Zhang, “A foveated silicon retina for two-dimensional tracking,” IEEE Trans. Circ. & Sys. II, To appear.
D N Mastronade, “Conrrelated firing of cat retinal ganglion
cells. i. spontaneuosly active inouts to x- and y-cells,” J. Neurophysiol., vol. 49, pp. 303–324, 1983.
M Meister, L Lagnado, and D A Baylor, “Concerted signaling by
retinal ganglion cells,” Science, vol. 270, pp. 1207–1210, 1995.
W M Usrey, J B Reppas, and R C Reid, “Paired-spike interactions and synaptic efficacy of retinal inputs to the thalamus,”
Nature, vol. 395, no. 6700, pp. 384–387, 1998.
A Murray and L Tarassenko, Analogue Neural VLSI: A Pulse
Stream Approach, Chapman and Hall, London, England, 1994.
Reyneri LM, “A performance analysis of pulse stream neural
and fuzzy computing systems,” IEEE Trans. Circ. & Sys. II,
vol. 42, no. 10, pp. 642–40, 1995.
W R Softky, “The highly irregular firing of cortical cells is inconsistent with temporal integration of random epsps,” J. Neurosci., vol. 13, pp. 334–350, 1993.
W Maass and Bishop C M, Eds., MIT Press, Boston MA, 1999.
A Mortara, E Vittoz, and P Venier, “A communication scheme
for analog vlsi perceptive systems,” IEEE Trans. Solid-State
Circ., vol. 30(6), pp. 660–669, 1995.
L Kleinrock, Queueing Systems, Wiley, New York NY, 1976.
A J Martin, S M Burns, T K Lee, D Borkovic, and P J Hazewindus, “The design of an asynchronous microproessor,” in Adavanced Research in VLSI: Proceedings of the Decennial Caltech
Conference, C L Seitz, Ed. 1989, pp. 351–373, MITe Press.
I E Sutherland, “Micropipelines,” Communications of the ACM,
vol. 32, no. 6, pp. 720–738, 1989.
A Martin,
“Programming in vlsi: From communicating
processes to delay-insensitive circuits,” Tech. Rep. CS-TR-89-01,
California Institute of Technology, Pasadena CA, 1989.
A J Martin, “Compiling communicating processes into delayinsensitive vlsi circuits,” Distributed Computing, vol. 1, no. 4,
pp. 226–234, 1990.
A J Martin, “Asynchronous datapaths and the design of an
asynchronous adder,” Formal Methods in System Design, vol.
1, no. 1, pp. 119–137, 1990.
K A Boahen, “A throughput-on-demand address-event transmitter for neuromorphic chips,” in 20th Aniversary Conference
on Advanced Research in VLSI, D S Wills and S P DeWeerth,
Eds., Los Alamitos CA, 1999, pp. 72–86, IEEE Computer Soc.
J Franca and Y Tsividis, Design of Analog-Digital Vlsi Circuits
for Telecommunications and Signal Processing, Prentice Hall,
New York NY, 1994.
B A Minch, P Hasler, C Diorio, and C A Mead, “A silicon
axon,” in Advances in Neural Information Processing Systems
7, G Tesauro, D S Touretzky, and T K Leen, Eds., Cambridge
MA, 1995, pp. 739–746, MIT Press.
J P Lazzaro, “Low-power silicon spiking neurons and axons,”
in IEEE Int. Symp. on Circ. & Sys. 1992, pp. 2220–2224, IEEE
Press.
IEEE TRANSACTIONS ON CIRCUITS & SYSTEMS, VOL. XX, NO. Y, MONTH 1999
[52] G Cauwenberghs and A Yariv, “Fault-tolerant dynamic multilevel storage in analog vlsi,” IEEE Transactions on Circuits and
Systems II, vol. 41, no. 12, pp. 827–829, 1995.
[53] T Delbruck, H Floberg, and L Peterson, “Address-event communication using lab-nb board, matlab, and a mac,” Tech.
Rep., California Institute of Technology, Pasadena CA, 1994,
http://www.pcmp.caltech.edu/aer/txrx/txrx.pdf.
[54] A Abusland, T S Lande, and M Hovin, “A vlsi communication
architecture for stochastically pulse-encoded analog signals,” in
Proc. IEEE Intl. Symp. Circ. & Sys., Piscataway NJ, May 1996,
IEEE Circ. & Sys. Soc., vol. III, pp. 401–404, IEEE Press.
[55] K A Boahen, P O Pouliquen, , A G Andreou, and A Pavasovic,
“Architectures for associative memories using current-mode mos
circuits,” in Proc. of the Decennial Caltech Conf. on VLSI, C L
Seitz, Ed., Cambridge MA, 1989, pp. 175–193, MIT Press.
[56] K Lam, , L Dennison, and W Dally, “Simultaneuos bidirectional signalling for ic systems,” in Proc. of the 1990 Conf. on
Computer Design (ICCD), Cambridge MA, 1990, pp. 430–433.
[57] L Dennison, W Lee, and W Dally, “High-perfomance bidirectional signalling in vlsi systems,” in Proceedings of 1993 Symposium on Research on Integrated Systems, Cambridge MA, 1993,
pp. 300–319, MIT Press.
[58] W J Dally and J W Poulton, Digital Systems Engineering, Cambridge University Press, Cambridge UK, 1998.
[59] G A Indiveri, M Whatley, and J Kramer, “A reconfigurable neuromorphic vlsi multi-chip system applied to visual motion computation,” in Microneuro’99: Seventh Intl. Conf. for Neural,
Fuzzy, & Bio-Inspired Systems, Los Alamitos CA, 1999, pp. 37–
44, IEEE Computer Soc. Press.
Kwabena A. Boahen is an Assistant Professor in the Bioengineering Department at the
University of Pennsylvania, Philadelphia PA,
where he holds a Skirkanich Term Junior Chair
and a secondary appointment in Electrical Engineering. He received a PhD in Computation
and Neural Systems from the California Institute of Technology, Pasadena, CA in 1997,
where he held a Sloan Fellowship for Theoretical Neurobiology. He earned BS and MSE degrees in Electrical and Computer Engineering
from the Johns Hopkins University, Baltimore MD, in the concurrent masters–bachelors program, in 1989, where he made Tau Beta
Kappa. His current research interests include mixed-mode multichip
VLSI models of biological sensory systems, and asynchronous digital
interfaces for interchip connectivity.
117
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement