Qualitative and Quantitative Evaluation of A Proposed Circuit

ICSE2010 Proc. 2010, Melaka, Malaysia
Qualitative and Quantitative Evaluation of A
Proposed Circuit Switched Network-on-Chip
New Chin-Ee, and Norhayati Soin
Department of Electrical Engineering,
University of Malaya (UM)
50603 Kuala Lumpur, Malaysia
Email: newchinee@perdana.um.edu.my
norhayatisoin@um.edu.my
Abstract- The advancement of semiconductor industry has led
to continuously increasing level of integration. Due to this and
driven by shorter time-to-market and product life cycle, the
industry has migrated into SoC paradigm. NoC is viewed as a
practical solution for SoC interconnection due to its reusability
and scalability. Existing NoC designs are mainly based on packet
switching. However, packet switching NoC requires significant
buffering resources, which consumes silicon area and power. An
alternative to packet switching is circuit switching based NoC. In
this paper, a circuit switched network protocol and NoC design
had been proposed and evaluated both qualitatively and
quantitatively. Simulations were performed to measure and
compare the performance of both NoCs to determine the viability
of CNoC as on-chip interconnection solution.
I.
INTRODUCTION
The semiconductor industry has been driven by accelerated
advancement in process and design technologies. Rapid growth
in the former has enabled higher level of integration, with the
current transistor feature size shrunken down to 32nm region.
The pursue of higher level of integration is to achieve chips
which are smaller, faster and energy efficient. At present,
industry players are moving progressively towards system-onchip (SoC) paradigm due to the increase in transistor density as
well as shorter time-to-market and design cycle [1]. SoC is an
integrated circuit which consists of a number of heterogenous
blocks, or intellectual property (IP) cores manufactured on a
single monolithic substrate [2, 11]. SoC enables design
reusability, where IP cores are assembled as modular
components to achieve the overall chip design [3]. An essential
element in reusing IP cores involves the exchange of
information between them [4]. In order to facilitate inter-core
data transfer, standards had been defined to ease integration.
Although the shrinking of transistor feature size offers a
multitude of benefits such as higher transistor density, lower
power consumption, faster clock, higher yield, and etc., it has
also worsened the deep sub-micron (DSM) effects such as
crosstalk, capacitive and inductive loads and electromagnetic
interference [5]. Although the power and processing of IP
cores scales in parallel with the integration trend, the inter-core
interconnections do not, resulting in inter-core interconnections
becoming a potential performance bottleneck and source of
significant energy consumption in the future [6].
108 978-1-4244-6609-2/10/$26.00 ©2010 IEEE
Currently network-on-chip (NoC) has become a topic of
intensive research as a feasible solution for on-chip
interconnection. Majority of existing researches revolve on the
design and optimization of packet switched NoC, whereas little
attention was given to the alternative, which is circuit switched
NoC. This work aims to evaluate and compare the benefits,
weaknesses and practicality of both types of NoC, as well as to
present the design of a circuit switched NoC.
II.
ON-CHIP INTERCONNECTION
On-chip interconnection had traditionally been based on
dedicated wires and shared buses. Dedicated wires are the
fastest but are not configurable, impractical as number of core
increases due to wiring congestion and high manufacturing
costs [1]. Shared bus comprises of a set of wires
interconnecting and shared by a number of cores. Presently, it
is a mature technology as much research had been done to
optimize the performance of shared bus and many sophisticated
bus architectures had been proposed such as segmented and
hierarchical bus architectures. Compared to dedicated wires,
shared bus is more flexible and reusable. The main drawback
of shared bus is that only 1 transaction is allowed at a time
[13]. Increasing the number of cores sharing a bus will result in
increase in load capacitance of the bus, subsequently degrading
the bus operating frequency [5-7]. This limits the scalability of
shared bus to less than 1 or 2 dozens of IP cores [10].
In order to address the limitations of dedicated wires and
shared buses, on-chip interconnection is moving into NoC
paradigm [12]. Generally, NoC consists of switches which are
interconnected by communication channels [1]. It is expected
to be able to achieve 3 major communications requirement for
SoC, which are reuseability, scalability and parallelism [5-7].
Although several notable NoC designs had been proposed,
much improvements are still needed in terms of area and power
consumption as well as scalability.
III.
COMPARISON OF PACKET AND CIRCUIT SWITCHING
There are generally 2 types of switching modes in NoC,
namely circuit (CNoC) and packet (PNoC) switching [8, 10].
In the former, a transmission path is established prior to the
data transmission. During the transmission, the entire path is
ICSE2010 Proc. 2010, Melaka, Malaysia
reserved and cannot be allocated to any other data transmission
[1,8]. The latter switching mode does not require setting up of
path prior to data transmission but instead relies on buffering
schemes, routing strategy and flow control to ensure successful
data transmission. Currently, many proposed NoC architectures
are packet switched and synchronous.
In terms of communication services, packet and circuit
switching typically provides best effort (BE) and guaranteed
transmission (GT) services respectively. For GT service, data
are transmitted with transmission and timing guarantees
whereas for BE service, only transmission guarantee is
provided. GT service is generally suitable for real time,
streaming data with constant data arrival rate. BE service is
generally suitable for system that involve heterogenous and
bursty traffic pattern such multimedia.
Packet switching typically requires large buffer resources at
network nodes to ease traffic congestion. This buffer
requirement can be reduced via properly designed and selected
routing schemes. Circuit switching does not require any buffer,
thus consumes less silicon area but with a tradeoff of requiring
extra latency for setting up the path for data transmission.
Although reservation of path by circuit switching may lead to
higher network contention and lower network usage, it may be
reduced via methods such as contention free routing, static
scheduling, virtual channels, virtual circuits as well as priorities
[8].
At present, packet switching had been the predominant
switching method in NoC designs [14-15]. One of the main
reasons behind the dominance of packet switching in NoC
designs is the highly successful and scalable Internet [10].
However, it is worthwhile to consider circuit switching due to
the following merits:
i.
Allows easier implementation of pipelined
asynchronous communication as data and control
signals can be separated
ii.
Requires minimal amount of control, eg. does not
require arbitration, thus increasing energy efficiency
and maximum throughtput
iii.
Contention free communication
iv.
Does not require any buffering schemes
Circuit switching is highly suitable for systems where the
majority of on-chip traffic requires GT instead of BE service
and are semi-static, which means that the data streams last for a
relatively long time.
One of the most significant drawbacks of circuit switching is
the blocking of routers due to the reservation of the physical
channels. A number of strategies had been proposed to address
this issue, centering on multiplexing the physical channel for
multiple data streams such as time division multiplexing
(TDM) and lane division multiplexing (LDM). In the former,
different time slots are allocated for different data streams to
utilize the channel. A pipelined TDM can be achieved by
reserving consecutive time slots in consecutive routers to 1
data stream. The latter method involves segmenting the bus
into smaller sets of bus which can be used by different data
streams simultaneously [8]. Although multiplexing techniques
may be able to relieve the router blockages, their
implementation requires additional logic and silicon area.
In packet switching domain, packet transfers are usually
performed via store-and-forward, wormhole and virtual cutthrough techniques [8-10]. Store-and-forward involves storing
the entire data packet at a node before transferring it to another
node. Huge buffer resource is needed, resulting in costly NoC
solution and higher per-node latency. In order to reduce the
buffers needed, designers usually resort to the latter 2 methods,
which are wormhole and virtual cut-through. In wormhole
method, packets are forwarded in smallest units of flow control
called flits immediately after the header flit has been examined.
Routing is based on the information stored in header flit and
the payload and trailer flits will follow the same route [1,8,9].
Similarly, virtual cut-through involves transmission of data in
flits, but current node would wait for guarantee from the next
node that the entire packet can be accepted prior to
transmission [9].
At present wormhole packet switching is most prevalent in
proposed NoC designs. However, an in-depth examination
indicates that a packet transferred via wormhole method would
occupy multiple links and node simultaneously, thus a stalled
header would result in the entire path to be blocked from other
data transmission. This is essentially similar to the issue faced
by circuit switching and implies that a circuit switching
oriented design may be inevitable in order to meet the silicon
resource constraints. Although several methods had been
proposed to address the issue of blocked nodes and links due to
the non-guaranteed nature of packet switching, most requires
additional resources for buffering or arbitration. A significant
example would be virtual channels, where stalled flits are
stored in the output buffer of the NoC node, thus freeing the
node for other data transmission [9].
IV.
PROPOSED CIRCUIT SWITCHING NETWORK PROTOCOL
Considering the merits of circuit switching, the proposed
NoC designs implement circuits switching for data
transmission as well as addresses some of the main issues
related to circuit switching. In order to establish
communication between IP cores, a network protocol has been
defined, which includes description of the structure of the data
packet as well as the handshaking algorithm prior and post of
data transmission. The handshaking algorithm can be
categorized as into 4 phases:
i.
Circuit setup
ii.
Payload data transmission
iii.
Circuit teardown
iv.
Circuit unavailable
Data transmission is first initiated by the source core by
sending out a special packet called CRT_SETUP packet. A
CRT_SETUP packet contains addresses of target and source
cores for routing purposes as shown in Figure 1. When an NoC
router receives a CRT_SETUP packet, it would be routed to
the next router and a connection will be set up at the NoC
router to link the incoming and outgoing ports of the router.
109
ICSE2010 Proc. 2010, Melaka, Malaysia
The packet is then propagated on until a circuit is established
between the source and target cores.
Fig 1: Complete transaction
Fig 3: Circuit unavailable: Intermittent mode
Fig 2: Circuit unavailable - persistent mode
Upon receiving CRT_SETUP packet, a target core will assert
a CRT_READY signal back to the source core to indicate that
a circuit has already been established and ready for payload
data transmission. When the source core detected the
CRT_READY signal, it can start the payload data transmission
to the target core by asserting DATA_VALID signal.
When payload data is completely transferred to the target
core, source core will initiate a circuit teardown phase by deasserting the DATA_VALID signal. This teardown condition
will be propagated along to all network nodes in the circuit
until the circuit is entirely torn down and all the network nodes
are released.
110
In the event that the NoC links are occupied by other
transactions and no alternative routes are available to establish
the route, the network node at which CRT_SETUP packet is
blocked will assert CRT_BLOCKED signal back to the source
core. When CRT_BLOCKED signal is detected, source core
can choose to operate either in intermittent or persistent mode.
In the former mode, source core will initiate a circuit teardown
and when CRT_BLOCKED is received and retry circuit setup
after a short duration. In the latter mode, source core will wait
until the blocked node is freed for transmission without tearing
down the partially completed circuit. The latter option can be
used when the data transaction is regarded as of higher priority.
As can be observed, source core plays the most active role in
all phases of the protocol.
V.
NETWORK TOPOLOGY AND ROUTING METHODOLOGY
Routing schemes can be categorized as either deterministic or
adaptive, and source or distributed. In deterministic routing,
packet route is determined solely by the target and source core
addresses, whereas in adaptive routing, network traffic is also
taken into account for routing decision. In source routing, the
route is determined at the source and the entire information for
the route is stored in the header of the packet, which is then
examined and used by routers to determine the next hop. On
the other hand, information of the route need not be sent in
distributed routing as routing decisions are made at each
network router [1] As compared to source routing, distributed
routing involves less overhead as it does not require the
transmission of information for entire route. However, it had
been claimed in some researches that distributed routing would
result in more expensive data routers as routing tables need to
be stored at each network router [10].
ICSE2010 Proc. 2010, Melaka, Malaysia
VI.
DESIGN OF NOC ROUTER
4
1
3
2
Fig 4: Illustration of multi-ring topology with core addresses
The proposed circuit switched NoC uses an adaptive and
distributed routing scheme in a multi-ring network topology. In
this topology, each IP core is given an address and is connected
in a circular manner by a few parallel rings as shown in Figure
4. Data transmission is initiated by IP core by sending
CRT_SETUP to the interface router, which is the router
connected directly to the IP core. Shortest path routing is done
at the interface router by comparing the source and target
address via the simple algorithm:
If ((target address > source address) and (target
address < intersection address))
Route to right port
Else
Route to left port
End if
where intersection address is defined as the address of the core
opposite of the source core in the circular ring.
The secondary ring functions as a reserved path for data
transmission when the primary path is blocked by an ongoing
data transmission. Only when the primary ring is blocked, a
data transmission will be routed to utilize the secondary ring.
In this topology, the worst case scenario occurs when a source
core transmits data to a destination core located furthest away,
thus occupying most number of network nodes. Considering
this routing strategy which essentially limits the maximum
route length to half of the ring, the maximum number of
simultaneous data transmissions permissible in the worst case
scenario can be calculated by the formula:
Max. Route = No. of Rings * 2
In CNoC, network router plays the most significant role as it
is responsible for almost all of the phases in the handshaking
protocol. The roles of router include routing, establishing and
maintaining circuit connections, as well as disconnecting
circuit once transmission is completed. Comparatively to
PNoCs, the main difference of CNoC routers is the lack of
buffers which generally consumes a lot of silicon area. Figure 6
shows the major components of the router, namely arbitrator,
ready and block controllers as well as configuration and block
registers.
The arbitrator block implements round robin based
arbitration which provides equal time slice for each input data
port of the router. Arbitration is only required during the circuit
setup phase and not required when the circuit has been
established. The arbitrator block is also comprised of a routing
block which performs simplistic routing decision as mentioned
in Section 5. If the target output port is available, the routing
logic will set the configuration register so that the input data
port can be connected to the target output port via the output
multiplexer, thus establishing a circuit for data transmission
within the router. If the target output port is not available,
routing block will hand over the responsibility to blocked
controller, which will set the block register so that a
CRT_BLOCKED signal can be transmitted back to the source
core.
In other words, the network router uses 2 sets of registers,
which are blocked and configuration registers to maintain the
states and connections of the router. Both registers will only be
cleared when CRT_TEARDOWN is received.
(1)
The benefits of the proposed routing strategy are:
i.
Simpler routing algorithm allowing distributed routing
without the need to maintain expensive lookup tables
at each router
ii.
Does not have any potential hotspots for congestion
due to topology and routing algorithm
iii.
Systematic adaptive routing utilizing reserved links
The probability of encountering worst case scenario can be
minimized by employing spatial locality of reference, where
cores which are predicted to require frequent communication
are placed near to each other.
Fig 5: Components of network router
VII.
METHODOLOGY
In order to evaluate and compare the performance of the
proposed CNoC with PNoC, simulators for each NoC had been
developed, which are capable of providing accurate functional
and clock count based timing analysis. The PNoC design used
in the simulation is output buffered and wormhole routing
based. Instead of performing simulations based on deduced
formula, the developed simulators emulate the real hardware
via concurrent simulation of the entire network. The simulators
were developed in C++ language and employs object oriented
111
ICSE2010 Proc. 2010, Melaka, Malaysia
performance-cost ratio in terms of buffering resource and that
increasing buffering resource is not a proper optimization
strategy for PNoC due to the fast saturation in performance
with the increase in network congestion.
70
Average clock count
designs to model the network nodes. In the simulations, both
NoCs consist of 10 transmitters and 10 receivers
interconnected by 20 router nodes in double ring topology as
shown in Figure 4. Since both CNoC and PNoC designs
evaluated in this study are synchronous, the networks are
modelled with 2 basic operations for each clock cycle, namely
synchronous and combinational operations as shown in Figure
6. The NoC designs are optimized so that each flit is forwarded
to the next node in a single clock cycle without degrading the
NoCs’ maximum operating frequencies.
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
No. of simultaneous transmission
Packet switched NoC
Circuit switched NoC
Fig 7: Comparison of total transmission time for PNoC and CNoC with
increasing network congestion
Fig 6: Modelling of synchronous circuits in NoC simulators based on clock
cycles
1400
VIII.
RESULTS AND DISCUSSIONS
Figure 7 shows the comparison of transmission
characteristics for random network traffic in CNoC and PNoC.
The graph shows a slight increase in both transmission times as
the number of simulataneous transmissions increases or
network becomes more congested. The average transmission
time for CNoC is consistently a little higher than PNoC. From
analysing the networks, this delay was found to be due to the
circuit setup time of CNoC prior to payload transmission.
However, with proper optimization and network localization,
the circuit setup time can be minimized, improving the timing
performance of CNoC. The low coupling between transmission
time and number of simulataneous transmissions also indicates
scalability for both networks as more network nodes and cores
can be added without affecting the transmission time
significantly.
Figure 8 shows the effect of increasing the size of output
buffers of each routers towards the timing performance of
PNoCs. The graph shows that increasing buffer size does not
improve the timing performance at low network utilization.
When network congestion increases, increasing buffer size
generally improves the timing performance of the network as
less links and nodes are occupied by a single transmission at
any 1 time. The graph also shows PNoCs suffer from low
112
Average clock count
1200
Modelling and simulation of NoC designs enables an
accurate and quick evaluation of performance of big NoCs
without requiring expensive hardware resources. It is also
possible to simulate heavy network loads to test for possibility
of deadlocks in the NoCs.
1000
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
No. of simultaneous transmission
20%
40%
60%
80%
100%
Fig 8: Comparison of the effect of increase in buffer size (% of packet
length) to total transmission time
As can be seen from Figure 9, the flit transmission time for
CNoC shows a dip at the center of the graph, indicating that the
highest performance for CNoC is when packet size is between
700 to 1300 flits. For smaller packet size, the ratio of setup
time over payload transmission time is higher resulting in
lower transmission efficiency. Packet sizes which are too big
would result in longer network congestion and lower network
utilization.
Figure 10 shows the timing performance for CNoC for
simulated network congestion. Network congestion is
simulated by setting all transmitters to transmit to a single
receiver simultaneously. The graph shows that the transmission
time for CNoC increases as the packet length and number of
simultaneous transmissions increase.
ICSE2010 Proc. 2010, Melaka, Malaysia
Flit Transmission Time
Using the simulators, a heavy network load was simulated by
transmitting a total of 1.6 MB data from 10 simultaneous
transmitters to a single destination core. Under a heavy traffic
load, it was found that the proposed CNoC does not experience
any deadlock and able to recover completely after the traffic
burst.
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
REFERENCES
0.2
0
90
0
11
00
13
00
15
00
17
00
19
00
70
0
50
0
[1]
30
0
10
0
CNoC, namely multiple links and nodes reservation during a
transmission. Quantitatively, simulations of PNoC showed that
PNoC suffers from a low performance to cost ratio in terms of
buffering resource. Simulations also shown that PNoC and
CNoC exhibits similar transmission characteristics for low and
heavy traffic load, except timing performance of CNoC is
generally slightly lower than PNoC. The latency is mainly due
to circuit setup prior to transmission and can be minimized via
network localization. Due to its advantages of not requiring
buffering resources, lower area and power consumption, better
scalability and high performance, CNoC maybe the better
altenative solution of on-chip interconnects. Further research
and optimization will be required to further improve on the
performance of CNoC and will be presented in future works.
[2]
Packet Length in Clock Counts
Fig 9: Per-flit transmission time for different packet length
[3]
[4]
[5]
[6]
[7]
[8]
Packet length
[9]
[10]
Fig 10: Transmission characteristic for simulated congestion in CNoC
[11]
IX.
CONCLUSION
In this work, the network protocol and design of CNoC had
been proposed and qualitative as well as quantitative
evaluations of the proposed network had been presented. From
the qualitative analysis of available PNoC designs, it was found
that efforts to address 1 of the most significant issues in PNoC
designs, which is limited buffering resource led to
implementation of wormhole routing for most PNoC designs.
However, wormhole routing suffers from similar drawback as
[12]
[13]
[14]
[15]
Moraes F., Mello A., Moller L., Ost L., & Calazans N., “A Low Area
Overhead Packet-Switched Network on Chip: Architecture and
Prototyping”, pp. 1-6.
Gupta R.K. & Zorian Y., 1997, “Introducing Core-Based System
Design”, IEEE Design and Test of Computers, vol. 4, pp. 1-5.
Saastamoinen I., Alho M. & Nurmi J., 2003, “Buffer Implementation for
Proteo Network-on-Chip”, IEEE, pp II-113 – II116.
Bartic T.A., Mignolet J.Y., Nollet V., Maraescaux T., Verkest D.,
Vernalde S. & Lauwereins R. 2003, “Highly Scalable Network on Chip
for Reconfigurable Systems”, IEEE, pp. 1-8.
Tortosa D.S. and Nurmi J., “Proteo: A New Approach to Network-onChip”, pp. 1-5.
Adriahantenaina A., Charlery H., Greiner A., Mortiez L. & Zeferino C.A.
2003, “SPIN: A Scalable, Packet Switched, On-Chip Micro Network”,
IEEE, Proc. Of the Design, Auto. & Test in Europe Conf. & Exh., pp. 14.
Zeferino C.A., & Susin A.A. 2003, “SoCIN: A Parametric and Scalable
Network-on-Chip”, IEEE Proc. of the 16th Symp. On Int. Circuits &
Systems Design, pp. 1-6.
Wolkotte P.T., Smit G.J.M., Rauwerda G.K. & Smit L.T., 2005, “An
Energy Efficient Reconfigurable Circuit Switched Network-on-Chip”,
IEEE, Proc. Of the 19th IEEE Int. Parallel and Distributed Processing
Symp., pp1-8.
Bjerregaard T. & Mvadevan S., “A Survey of Research and Practices of
Network-on-Chip”, ACM Computing Surveys p. 33.
Dielissen J., Radulescu A., Goossens K. &Rjipkema E., “Concepts and
Implementation of Philips Network-on-Chip”, pp.1-6
Ali, M., Welzl, M. & Zwicknagl, M. 2008, “Networks on Chips: Scalable
Interconnects for Future System on Chips”, IEEE, pp. 240-245.
Zeferino, C.A., Kreutz, M.E., Carro, L. & Susin, A.A., 2002, “A Study on
Communication Issues for Systems-on-Chip”, IEEE, Proc. of The 15th
Symp. on IC and Systems Design (SBCCI’02), pp. 1-6
Henkel, J., Wofl, W., & Chakradhar, S., 2004, “On-chip networks: A
Scalable, Communication-centric Embedded System Design Paradigm”,
IEEE, Proc. of the 17th Int. Conf. on VLSI Design, pp.
Moraes, F., Mello, A., Moller, L., Ost, L., & Calazans, N., “A Low
Area Overhead Packet-Switched Network on Chip: Architecture and
Prototyping”, pp. 1-6
Zeferino, C.A., & Susin, A.A, 2003, IEEE, Proc. of the 16th Symp. on IC
and Systems Design (SBBCC’03), pp. 1-6
113
Download PDF
Similar pages