Network Processor specific Multithreading tradeoffs Victor Boivie Reg nr: LiTH-ISY-EX--05/3687--SE

Network Processor specific Multithreading tradeoffs Victor Boivie Reg nr: LiTH-ISY-EX--05/3687--SE
Network Processor specific
Multithreading tradeoffs
Victor Boivie
Reg nr: LiTH-ISY-EX--05/3687--SE
Linköping 2005
Network Processor specific
Multithreading tradeoffs
Examensarbete utfört i Datorteknik
vid Linköpings Tekniska Högskola
Victor Boivie
Reg nr: LiTH-ISY-EX--05/3687--SE
Supervisors: Andreas Ehliar and Ulf Nordqvist
Examiner: Dake Liu
Linköping, 2005.
Institution och avdelning
Institutionen för systemteknik
581 81 Linköping
Publiceringsdatum (elektronisk version)
Annat (ange nedan)
Övrig rapport
ISRN: LITH-ISY-EX--05/3687--SE
URL för elektronisk version
Network Processor specific Multithreading tradeoffs
Network Processor specific Multithreading tradeoffs
Författare Victor Boivie
Multithreading is a processor technique that can effectively hide long
latencies that can occur due to memory accesses, coprocessor
operations and similar. While this looks promising, there is an
additional hardware cost that will vary with for example the number of
contexts to switch to and what technique is used for it and this might
limit the possible gain of multithreading.
Network processors are, traditionally, multiprocessor systems that
share a lot of common resources, such as memories and coprocessors, so
the potential gain of multithreading could be high for these
applications. On the other hand, the increased hardware required will
be relatively high since the rest of the processor is fairly
small. Instead of having a multithreaded processor, higher performance
gains could be achieved by using more processors instead.
As a solution, a simulator was built where a system can effectively be
modelled and where the simulation results can give hints of the
optimal solution for a system in the early design phase of a network
processor system. A theoretical background to multithreading, network
processors and more is also provided in the thesis.
multithreading, network processors, computer architecture, system level design exploration
Multithreading is a processor technique that can effectively hide long latencies that can occur due
to memory accesses, coprocessor operations and similar. While this looks promising, there is an
additional hardware cost that will vary with for example the number of contexts to switch to and
what technique is used for it and this might limit the possible gain of multithreading.
Network processors are, traditionally, multiprocessor systems that share a lot of common
resources, such as memories and coprocessors, so the potential gain of multithreading could be
high for these applications. On the other hand, the increased hardware required will be relatively
high since the rest of the processor is fairly small. Instead of having a multithreaded processor,
higher performance gains could be achieved by using more processors instead.
As a solution, a simulator was built where a system can effectively be modelled and where the
simulation results can give hints of the optimal solution for a system in the early design phase of
a network processor system. A theoretical background to multithreading, network processors and
more is also provided in the thesis.
First I would like to thank my supervisors, Andreas Ehliar at the university, and Ulf Nordqvist
at Infineon Technologies, for the opportunity to work with this interesting and challenging project
and for guiding me throughout the project.
I also feel gratitude towards the people I worked with at Infineon Technologies in Munich.
Thank you Xianoning Nie, Jinan Lin and Benedikt Geukes for all your help.
I would also like to thank my examiner, Dake Liu, professor at the Computer Engineering
Division, for offering me the opportunity to work on this project.
My opponent, David Bäckström, should also be mentioned here for giving me valuable feedback
which helped me improve the thesis.
Last, but absolutely not least, I would like to thank all of my friends for being there all the
time. We really had a lot of good time together.
All the information associated with a processor thread. For example all registers,
program counter, flags and more.
A lightweight process
Instruction level parallelism
Thread level parallelism
Content Addressable Memory - an associative memory
Ternary CAM
General-purpose processor
Application Specific Integreated Circuit. Dedicated hardware for a specific function
or algorithm.
Application Specific Instruction-Set Processor. A processor whose instruction set
has been adapted to a certain application.
Network Processor. A programmable hardware device designed to process packets
at high speed. Network processors can perform protocol processing (PP) quickly
Protocol processor. A processor specialised for protocol processing
A set of message formats and the rules that must be followed to exchange those
The interconnection of open systems in accordance with standards of the International Organization for Standardization (ISO) for the exchange of information.
Network Interface Controller
Traffic which comes from the network into the network controller. Incoming traffic.
Traffic which comes from the network controller and is destined for the network.
Outgoing traffic.
Internet Protocol which is the network layer protocol for TCP/IP suite. A connectionless best-effort packet-switching protocol.
Time To Live (field in the IP header). Defines how many router hops a packet can
be routed before it will be discarded
Media Access Control
Asynchronous Transfer Mode. A high speed network protocol which uses 53 byte
ATM Adaption Layer Five. Used predominantly for the transfer of classic IP over
Wide Area Network
Local Area Network
Header Processing Applications
Payload Processing Applications
Instruction Set Architecture
Network Address Translation
Infineon’s 32-bit Packet Processor
1 Introduction
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background
Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
System Level Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Area Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Previous Work
Spade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
StepNP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Proposed Solution - EASY
Difference from Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
System Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Simulation 1 - assemc
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Application modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The simulation goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Simulation 2 - NAT
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Network address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Paritioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Possible settings and questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Packet lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusions
8 Future Work
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Suggested features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Modeling Language Microinstructions
A.1 Executing commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Statistics instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Program flow instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 Interrupt instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.5 Switch instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.6 Semaphore instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7 Message passing instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B Architecture File
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Semaphores declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Memory declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.4 Coprocessor declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.5 Processor declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 Dump statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.7 Queue statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.8 Penalty statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C Simulator Features
D NAT - Flow Chart
Chapter 1
Network processor systems are traditionally multiprocessor systems that share some common resources, such as memories, coprocessors and more. When a processor wants to work with a resource
and it is currently processing a previous request, the processor will have to wait until the resource
is free until it can issue the request on the resource once more. While it is waiting, it will normally
not do anything useful, but if the processor supports multithreading, it can start working with
something else while waiting and later, when the resource is free, it can continue where it stopped
This processor feature is called multithreading and can be implemented in many different
ways depending on how many threads it can work on at the same time (but not necessarily
simultaneously), the length of the delay required to switch from one thread to another and more.
The different techniques differ a lot in hardware costs, which occupies expensive area on the chip,
so finding an optimal technique for a given system can lead to high performance number while still
having a low area.
available chip
three singlethreaded
two multithreaded processors
with three contexts each
Processor core
Processor context
Figure 1.1: Different architectural choises for a fixed chip area.
For a network processor system, the processors traditionally have a low chip area and introducing multithreading is very expensive. Instead of having one processor with a high number of
threads, the same area can instead be used for deploying two singlethreaded processors instead, or
similar as shown in figure 1.1. Finding the optimal solution by hand will be very difficult, or even
impossible, for a large system.
Processor design decisions must be made early in the design phase to cut down the amount of
time spent on development, so a methodology must be used to get early numbers that, even though
they might be uncertain and approximate, give a hint on how the next step in the development
should be taken. In this stage, there are no instruction set architecture simulators nor definitive
target applications.
A solution to this problem is the system level design exploration simulator for multithreaded
network processor systems that was created for Infineon Technologies by the thesis author. It
allows an architecture and an application to be shallowly described in a short amount of time,
but still generates important information on how the system behaves in respect to multithreading
properties. The architecture and application can then be changed or refined to find an optimal
The main objective of this thesis is to find the optimal multithreading solution for a given network
processor system and application, and to be able to easily change the architecture or application
to reflect the uncertainty of the early design phase.
• To design a simulator that can generate performance numbers suitable for decision making
early in the design phase
• To simulate a benchmark from Infineon Technologies using the proposed simulator and extract interesting results.
• To design a larger benchmark to see if it is possible to model a complex network processor
system with many processors and shared resources
First, a theoretical background in network processors, processor architecture and specifically multithreading aspects were collected from various papers, literature and other publications. When
it was thoroughly studied, different benchmarks and other documentation was studied at Infineon
to get a good overview how their typical network systems were designed. This was later used
to design and implement a high level simulator, which was written in C++ due to the author’s
This thesis work has been performed during a period of 20 weeks in accordance with the requirements at the university of Linköping.
Chapter 2
Computer Networks
This is only a short introduction to explain some of the network terms that will be used throughout
this report.
Layered communication
The end-to-end communication over a network can be broken down into several layers. Every layer
must only pass information up to the layer above it, or down to the layer below it and from the
layer’s point of view, it can communicate directly with the receiving layer at the other endpoint
since the other layers make the rest of the communication transparent.
Layer models
The Open Systems Interconnection (OSI) reference model divides communication into seven distinct layers, seen in figure 2.1. The by far most common model on the internet is the TCP/IP
model, which divides the communication into four layers:
Application layer
This layer corresponds to the application that the user is running and that tries to communicate.
This could, for example, be a web browser (using the HTTP protocol), an e-mail client (using
POP/SMTP protocols).
TCP/IP layers
Protocol examples
Data Link
OSI layers
Figure 2.1: Layer models and protocol examples.
Transport layer
Here, TCP and UDP are the most common protocols. TCP creates a reliable stream channel
between two endpoints and UDP creates an unreliable packet-based channel. Most communication
is packet based on the internet, and TCP must convert the unreliable packet based communication
into reliable streams.
Network layer
Almost all communication over the internet is done using the IP protocol, which can be found
at this layer. It is a packet-based unreliable best-effort service. All hosts on the internet (with
the exception of hosts behind a NAT server) can be reached by a globally unique address, called
an IP address. IP packets can not be infinitely large, so when the transport layer wants to send
something that is larger than the maximum transfer unit (MTU), IP will break apart (fragment)
the information into smaller packets which it will send and later rebuild (reassemble) the parts at
the receiving side.
Link layer
This layer represents the network card, the physical wires and such and its main task is to transfer
the actual data to the receiver. This might include splitting the data into smaller portions, called
frames, and add error checking and similar.
The layers do not have to know much about the layers below or above it. The network layer
only has to know that it must give the link layer an IP packet, and that it will receive an IP packet
at the receiving side. The same goes for the rest of the layers. However, for example a protocol at
the application layer can decide whether it wants to use TCP or UDP which is in a layer below it.
Network Processors
The Internet has influenced the way we communicate and build our infrastructure significantly,
and new applications that make use of the Internet in new ways are invented every day. The
rapid growth of it has made it possible for a lot of users to benefit from it all the time. These are
two reasons why it is difficult to develop good network system architectures – The requirements
changes often and the speed requirements increases rapidly. Like early computer designers, the
builders of network processor architectures have experimented a lot with architectural aspects,
such as functional units, interconnection and other strategies since there is a lack of experience
and no proven correct solution. Due to this, there is no standard architecture, as there is for most
general purpose computing.
The First Network Processing Systems
In the 1960s and 1970s, the first network systems were introduced, and during this time, the CPU
of a standard computer was fast enough to process the relatively slow speeds that the networks
operated in. In the following years, CPU speeds increased in a higher rate than the network speed,
and a small computer could take care of more and more complicated tasks, such as IP forwarding.
Increasing Network Speeds
Nowadays, the network speeds are increasing much faster than the rate of computer architecture.
Only a few years ago, 100Base-T replaced the old 10Base-T networks at companys, and now many
are moving to 1000Base-T.
Small packets (us)
Large packets (us)
Table 2.1: Network speeds and the time available for large and small packets
As seen in table 2.2.1 (and with some math), a single single processor running at 1GHz,
which is considered to be a high frequency, have only time to execute around 500 cycles for every
small packet when working with 1000Base-T. This is a very small number and insufficient for most
processing. A router with 16 interfaces in this speed can only spend 31 cycles on each packet which
is far from enough. In other words, the increasing network speeds are a big problem to network
Different Architectures
Since single general-purpose CPUs are insufficient for processing packets at a high rate, many
alternative architectural design explorations have been introduced[2] to overcome the performance
problem. These techniques have previously proven to increase performance for certain types of
computational-heavy processing.
• Fine-grained parallelism (Instruction level parallelism)
• Symmetric coarse-grained parallelism (Symmetric multiprocessors)
• Asymmetric coarse-grained parallelism
• Special-purpose coprocessors (ASIC)
• NICs with onboard stacks
• Pipelined system
All these solutions have their advantages and disadvantages, and a trade-off has to be done
when choosing any of them, if not a completely different solution is selected instead. This is due to
the fact that network processing is a collective name for so many different tasks and their difference
makes it impossible to find a general ideal solution.
Fine-grained Parallelism (Multi-issue)
This is a widely used technique for high performance systems. The idea is that a processor
performing operations in parallel should be able to process much more data at a time. In a normal
program, some instructions can be run at the same time and the detection of parallelism can
either be perform in compilation time (as for VLIW processors) or on-the-fly while the program is
executing, as for superscalar processors.
It has been shown that for network systems, instruction-level parallelism does not achieve
significantly higher performance[16] and the architectural costs are high.
Symmetric Coarse-grained Parallelism (GPP)
Instead of using instruction-level parallelism (instructions that can be run in parallel), thread
level parallelism (TLP) can be exploited instead. This requires that the program is structured in
that way so that portions of the code (threads) can be run independently from each other. TLP
makes it possible to run an application on multiple processors at the same time and this has a
few advantages, one of these being that no modification has to be performed on the processor and
normal small processors can be used.
It will also be fairly easy to scale up the system - just improve the number of processors, but
unfortunately the performance will not scale as easily. One major bottleneck is that they often are
connected to the same memory which they must share. The same will also be true for other shared
resources such as buffers, coprocessors and more. In the worst case in a system of N processors, a
processor might have to wait for as much as N − 1 processors to be able to communicate with a
shared resource.
Since normal general purpose processors are used, the same amount of processing per packet
is still required, but since many more packets can be worked on simultaneously, the overall performance will increase.
Special-purpose Coprocessors (GPP+ASIC)
Another solution is to have a general-purpose processor together with a special-purpose coprocessor (ASIC) that can perform some special operations very fast. The coprocessors are very
simple in design (they do not have to do anything else except what they are built for) and can be
controlled by the general purpose processor. If the operations that account for most of the processing of a packet are performed by the ASIC, the performance on the whole system will improve
It is also possible to have an ASIP (Application Specific Instruction-set Processor) together
with coprocessors and this can have even higher performance gains, but will become slightly less
Application Specific Instruction-set Processors (ASIP)
Another possibility to increase performance is to have specialised processors. Each processor is
very good at performing its task, for example IP fragmentation, while another takes care of another
layer or similar. This can increase the performance a lot, but some drawbacks are that they will be
more difficult to program than a general processor, it will only be good at what it was intended for
and if this changes, the performance can degrade significantly. Another big disadvantage is that
they are expensive to design and build.
A trade-off can be made in the way the processor is specialised. There are some common
tasks for all protocol processing applications[11], such as bit and port operations, so if these are
optimised, the performance of many applications - even the ones not yet known - will increase, and
if these optimizations are not as extreme, the processor will also be more flexible.
Pure ASIC Implementation
For ultimate performance, but also for the cost of least flexibility, specialised hardware can
be used directly. Designing an ASIC is expensive and since they can not do anything other than
what they were designed for, their use is limited in most applications.
Pipelined Architecture
Instead of processing an entire packet or a certain layer for a packet, the processing can be
broken into smaller operations that are performed sequentially by different processing elements.
For example, when an IP-packet is forwarded, first the checksum is verified, then the TTL (Time
To Live) field is decremented, the destination address is looked up in a table and so on.
It would be possible for one processing element to verify the checksum, then pass it to the
next processing element that decrements the TTL, while at the same time, a new packet is fetched
by the checksum verification processor. This will be repeated for all remaining operations that has
to be done for a packet. Even though one packet is processed at the same time as for a simple
processor, more packets are processed simultaneously thus increasing the total throughput of the
system. The hardware requirement for each stage is low since it will only have to do a small task,
but balancing the stages is difficult (the chain will not be faster than its slowest link), and it will
be difficult to program.
Packet Processing Tasks
To be able to understand how the hardware architecture for network processors should be built, it
is essential to know what a packet processor actually does.
A network processor can work on very different areas and layers (see page 5) and because of
the diversity among these tasks, it is very difficult to categorise them and to compare different
tasks and to understand why they have chosen a specific architecture. A network processor that
mainly works on the link layer (ethernet frames for example) but processes packets at higher layers
also usually optimise their architecture for the low layer processing even though they perform a
lot of high layer functions.
The functionality of network processing is often divided into these categories:
• Address lookup and packet forwarding
• Error detection and correction
• Fragmentation, segmentation and reassembly
• Frame and protocol demultiplexing
• Packet classification
• Queuing and packet discard
• Security: authentication and privacy
• Traffic measurement and policing
• Traffic shaping
Address lookup and packet forwarding
Address lookup is frequent in several layers. At Ethernet switching, the MAC address is looked up
in a table to know where to forward the frame. At IP level, the IP address is looked up during IP
routing to know where to forward the packet and so on. There are many more cases, and in all of
them, the system maintains a table and uses it to perform the lookup in. This lookup can be more
or less advanced and the complexity differs a lot. While Ethernet switching lookups are fairly easy
(they look for an exact match of a MAC address in a fairly small table), IP lookups can be more
advanced requiring a partial match (longest prefix matching) among a table of up to 80000 entries.
In high performance solutions, dedicated hardware for table lookups and maintenance is required,
and content addressable memories (CAMs) are common.
Error detection and correction
Error detection is a very common and heavily analysed feature of protocol processing. The need
for error detection is essential – bit errors often occur making a packet corrupt. This can be due
to signalling problems when transferring the packet, or due to faulty hardware or software. Error
detection is present in many protocols using e.g. checksums and the computational power required
to verify or calculate a checksum can be large compared to the rest of the processing of that layer
(e.g. CRC in Ethernet).
In most network system solutions, dedicated hardware for calculating Ethernet CRC checksums are present since they must be calculated for every incoming and outgoing Ethernet frame.
For higher level protocols, checksum calculation is often performed in software.
Error correction is not very common in today’s systems since the layered OSI model puts most
responsibility to the lower layers (error correction would have to be implemented there), and the
additional data and processing required for correcting bit errors is not insignificant.
Fragmentation, segmentation and reassembly
Fragmentation is what IP does to split up large higher layer packets into smaller chunks that fit
inside an IP packet, and this is what segmentation is for splitting large AAL5 packets into ATM
cells. Fragmentation and segmentations is fairly straightforward while reassembly can be more
complex. The length of the full packet is not known in advance, the packets can be delayed and
lost, and they can even come in the wrong order. This makes reassembly to be both computationally
complex and require extra resources such as large memories and timers.
Frame and protocol demultiplexing
Demultiplexing is an important concept in the OSI layer model. For example, when a frame arrives,
the frame type is used to demultiplex the frame to see to which upper layer it should be passed
to, for example IP or ARP. This is used throughout the layers.
Packet classification
Classification is to map a certain packet to a flow or category, which is a very broad concept. For
example, a flow can be defined as:
• A frame containing an IP datagram that carries a TCP segment
• A frame containing an IP datagram that carries a UDP datagram
• A frame that contains other than the above
These flows are static and decided before any packets arrive. It is also possible to have dynamic
flows that are created during the processing and an example would be to map a certain IP source
address to a flow for extra treatment. Classification can work with data from several layers and
can, in contrary to demultiplexing, be stateful. Looking up a packet among several flows might
require a partial match search using a ternary CAM (TCAM).
Queuing and packet discard
Packet processing is characterised as store-and-forward since packets are normally stored in memory
while they are waiting to be processed. This is called queuing and can be more or less complicated.
A simple example is a standard FIFO which guarantees that packets are processed in the order
they arrived, but for more advanced situations, it might be required to introduce priorities among
the queues to allow packets from a certain flow to be processed more often than remaining flows.
Security: Authentication and privacy
In some protocols, authentication and privacy, which both relies on encryption, is provided and
the network systems will in some cases have to process these packets. The additional processing
required for authentication or privacy covers the entire packet and is very computation extensive,
and if high performance is essentital, dedicated hardware is required.
Traffic measurement and policing
Traffic analysers perform traffic measurement to gather statistical information of what type of data
flows through the system. This requires that all frames are analysed, and that their contents is
also analysed to examine upper layer header items and contents. This information might be used
for billing or similar.
Traffic policing is similar, but is used to restrict access in some ways. For example, for a
customer who has bought a connection with limited bandwidth, traffic policing is used for dropping
packets that exceed this rate.
Traffic shaping
Traffic shaping is similar to traffic policing and is used for enforcing more soft limits and attempts
to change traffic without drastically dropping packets too much. This requires often large buffers
and good timer management.
A Typical Network Processor System
SDRAM controller
PCI controller
Schratchpad SRAM
SRAM controller
Figure 2.2: The IXP1200 Architecture, simplified.
The title is misleading - there are no typical network systems that can be representative for all
protocol processing. But in the last years, some vendors have created architectures that are
composed quite similarly. The Intel IXP architecture is one example that will be studied shallowly
now, which is one of Intel’s recent network processor product families. A simplified overview can
be found in figure 2.2
Fast path and slow path
The IXP contains both control plane and data plane processing, which can be called the slow path
and fast path processing.
Host processor
Traditionally, there is at least one embedded RISC processor, or another type of GPP that
takes care of some slow path processing, higher protocols and other administrative tasks, such as
handling exceptions, updating router tables and similar. In the IXP network processor, this is a
Packet processors
For the fast path, there are a number of small packet processors that have a specialised
instruction set, which is more limited than a traditional RISC. In the IXP network processors,
these are called microengines and have multithreading support for four thread which all share the
same program memory. The special instructions the ASIPs have are optimised for packet processing
and the processor can run in high clock frequency. The processors do not run an operating system.
Coprocessors and other functional units
To offload some of the tasks that require high computational power, there are some coprocessors on the chip. They perform common tasks in protocol processing, such as computing checksums, looking up values in tables (CAMs) and more. Some other tasks as timers and similar can
be provided by coprocessors to make it easier for the programmer.
The packet processors can have access to both a private fast, small memories and a shared
larger memories. A large off-chip SDRAM, a faster external SRAM, internal scratchpad memories
and local register files for all microengines can be used for storing data. Each of these memories
are under full control of the programmer, and there is no hardware support for caching of data,
except for the StrongARM. There are also a number of transmit and receive FIFOs to store packet
information in.
Multithreading is a processor feature to reduce the inefficiency for long instruction latencies and
other reasons that prevents the processor from fully utilising its resources[15]. When, for example,
a thread is blocked due to a multi-cycle operation, the processor can perform a context switch and
resume the operation of another thread. If this is handled fast enough, the amount of time the
processor is idle waiting for an internal or external event is reduced, thus making it more efficient.
In the past, multithreading was considered to be a too expensive technique to achieve higher
performance due to the fact that the hardware changes required for the technique were to large.
During the later years, this consideration has changed due to these reasons:
The first reason is the ever increasing memory gap. Throughout the time, the processor speed
has steadily increased between generations, and since they can perform computations faster, they
also need data faster as well. One problem is that the memory speed has not increased in the same
pace as the processor speeds. The result is that the processor will have to wait for the memory
to complete its request, which will require the processor to spend cycles waiting when it instead
could perform useful computation. Especially multiprocessor systems that communicate with a
shared remote memory have this problem. Remote memories, often off-chip, are generally large,
which results in a larger access time. Shared memories can often only perform one request at a
time, which can lead to collisions requiring requests to be queued, which will even more delay the
requests to it.
Another reason is the increased use of accelerators, or dedicated coprocessors, that perform
some computation for a host processor. In high performance systems, it is common for a host
processor to off-load some of its time-consuming and common tasks to dedicated hardware that
can perform the computation much faster. During this time, the processor generally will have to
wait until the coprocessor has finished its computation.
Context Switch Overhead
The context switch overhead is the time it takes to switch from one thread to another. This
often implies replacing the processor’s internal states such as registers, flags, program counter and
other data that is associated with a certain thread. The context switch overhead depends on
the technique on how you save the other context and how much you must save. If the context
switch overhead is larger than the latency source which triggered the context switch, the processor
performance will of course degrade. For short latencies, a fast switch technique is required, which
often requires quite complex hardware.
These are some techniques that are common in multithreaded solutions:
• All contexts in memory and no hardware support
• All contexts in memory but with hardware support
• All contexts kept in dedicated hardware registers
• Some contexts kept in hardware and some in memory
All contexts kept in memory and no hardware support
This is often the case for common general purpose processors (GPP) which have software running
on them that takes care of the context switching. In most cases, an operating system is responsible
for changing the currently active thread, and this is often performed by saving all registers in a
thread to one location in the memory, and then loading another thread’s registers from another
location in the memory. There is no, or little, additional hardware required for this, but the switch
overhead will be high. This is depending on the number of registers in the processor, the memory
access time and how efficient the code can save the registers from the old thread and load the
registers from the new thread. This normally takes a few hundred cycles to complete, so short
latencies (on-chip memory or fast coprocessor accesses) can not be hidden using this technique.
Operating systems normally handle multithreading by switching between the different threads
at a fixed interval, called a time slice. The time slice is often very small, so that the user doesn’t
notice this effect and thinks that all threads are running at the same time. Using different priorities
among the threads, the length of the time slice can be varied so that a certain thread will get
more cpu time than another. This technique is called preemptive multithreading and differs from
cooperative multithreading which was more common earlier. It relied on the each program to
volountarily tell the OS when it was finished. If the programs didn’t behave as expected, this
could lead to dangerous results, but the complexity of the system was smaller. For hard real-time
systems, this is still used when the tasks can guarantee that they will finish in the designated
timeslot since the complete execution will be deterministic (static scheduling).
All contexts are kept in memory but with hardware support
In this case, dedicated hardware takes care of storing and loading the registers’ contents from a
high performance memory. The hardware required is fairly low and the context switch overhead
is depending on the number of registers and how fast the memory is. An estimate is around tens
of cycles.
All contexts are kept in hardware
This requires that there are multiple copies of the register file, flags, program counter and all other
internal states. This can lead to very fast switch times since no data has to be saved or loaded
from memory. To be able to switch in one (or zero depending on how you count) cycles, copies
of the pipeline will also have to be saved for every thread. If the pipeline is not copied, it will
have to be flushed whenever a context switch happens which will result in as many cycles lost as
the number of pipeline stages before the execution stage. This technique, especially if you save
the pipeline, requires quite a lot of additional hardware, but it allows very fast context switches.
It will require no software although software can be used, but the hardware logic for the context
switching will be fairly advanced.
Some contexts kept in hardware and some in a fast memory
This can be a reasonable trade-off if the number of threads you want to run is high. In this
case, some threads are kept in hardware and some are kept in software. With help of dedicated
hardware, the context that is not currently executing can be replaced by another thread while the
processor is executing. This can lead to a fast context switch, but in the worst case, the delay will
be high. The hardware required for this is moderate compared to having all contexts in hardware
and the performance can vary between moderate and good depending on how much effort is spent
on preloading and how the target application looks like.
Multithreading Techniques
Explicit multithreading techniques are divided into two groups[1]; they who issue instructions from
a single thread every cycle and they who issue from multiple threads every cycle, as can be seen
in figure 2.3
Issuing from a single
thread per cycle
Multithreading (IMT)
Multithreading (BMT)
Issuing from multiple
threads per cycle
Multithreading (SMT)
Figure 2.3: Multithreading techniques.
The techniques that only issue from one thread per cycle is most efficient when applied to
RISC or VLIW processors, and is further divided into two groups:
• Interleaved multithreading (IMT). In every cycle, a new instruction is fetched from a thread
different from the one currently running.
• Blocked multithreading (BMT): A specific thread is running until an event occurs that forces
the processor to wait for the results of the event, for example latencies due to memory
accesses. When this happens, another thread is invoked during a context switch.
Interleaved multithreading
thread 1
thread 2
thread 3
active thread
short instruction
long latency instruction
Figure 2.4: Interleaved multithreading handling a long latency.
Interleaved multithreading, or fine-grained multithreading as it is also called, means that the
processor performs a context switch on every cycle. Some gains of doing this are that control and
data dependencies between instructions in the pipeline can be eliminated[17]. This can reduce
a lot of hazard-reduction processor logic, such as forwarding that removes true data dependency
thread 1
thread 3
thread 2
thread 4
Figure 2.5: Interleaved multithreading eliminating branch penalties.
As can be seen in figure 2.5, branch penalties due to mispredicted target addresses can also
be avoided since the processor does not need to fetch instructions from the same thread until the
branch condition has already been evaluated. In the figure, the instruction ’A’ is a conditional
jump whose condition can not be determined until at the execute stage (EX). It has been fetched
at cycle 1, but at the next cycle, an instruction from another thread will be fetched. At cycle 3,
the ’A’ instruction will be executed, and now the jump destination is known. At cycle 5, the next
instruction from the first thread will be fetched, and since we know exactly which instruction to
fetch, there will be no branch penalty.
This technique requires at least as many threads as pipeline stages in the processor. By
not issuing cycles from a thread that is blocked due to an instruction that led to latency, longer
latencies can also be avoided, as seen in figure 2.4.
Blocked Multithreading
Blocked multithreading, or coarse-grained multithreading, means that the processor executes instruction from a thread until an event occurs that cause latency thus forcing the processor to be
idle. This triggers a context switch resulting in the processor executing cycles from another thread
instead. Compared to IMT, a smaller number of threads is needed and a thread can continue at
full speed until it gets interrupted[17]. An example of how it could look can be seen in figure 2.6.
thread 1
thread 2
memory access
coprocessor access
memory access
coprocessor access
long latency instruction
Figure 2.6: Blocked multithreading with a context switch overhead of one cycle.
Explicit switch
Implicit switch
Switch on use
Switch on cache
Switch on signal
Conditional switch
Figure 2.7: Blocked Multithreading
Blocked multithreading is classified by the event that triggers a context switch, and they can
be divided as seen in figure 2.7.
Static models
In this case, a context switch is invoked by an instruction. This allows the programmer to have
full control of the context switching and if the instruction format is simple, these instructions can
be identified very early in the pipeline. This will lead to a low overhead since the processor can
then on the next cycle fetch instructions from another thread, and the pipeline does then not have
to be flushed to remove old instructions from the previous thread. There are two cases of static
• Explicit switch. In this case, there is a specific switch instruction that forces the processor to
perform a context switch. Since this instruction doesn’t do anything else useful, the overhead
will be one additional cycle if it is detected already at the first pipeline stage and will be
more if it is detected later.
• The implicit switch model. In this case, the instruction does something useful, for example
loads from a memory or branches to a target address, but also performs a context switch
so that the next cycle fetched will be from another thread. ”Switch on branch” can avoid
branch prediction and speculative execution if the instruction is identified soon enough in the
pipeline stages. This results in an additional overhead of zero cycles if it is detected at the
first pipeline stage, but there are some negative issues also. For example when using ”switch
on load”, there will be very many switches throughout the execution of the program and this
requires a very fast context switch - preferably in zero cycles. Some architectures have tried
to reduce this, and will only switch when communicating with an external memory and not
switch when reading from a local memory.
Dynamic models
Dynamic models are the cases where a context switch is triggered by a dynamic event. In these
models, the decision to perform a context switch will be performed later in the pipeline which will
either require a pipeline flush, or require multiple copies of the pipeline, as discussed on page 16.
Since the context switch triggering is dynamic and handled by the processor, the programmer
does not have to consider it when programming and unnecessary or very frequent context switches
can be avoided since the processor has more knowledge of the current internal state of the processor
and its resources.
• Switch on cache miss. This model switches to another thread if a load or store instruction
resulted in a cache miss when writing or reading from the memory. The gain of this method
is that a context switch will only occur when a long latency is expected, but can result in an
overhead since checking the cache memory will take some time.
• Switch on signal. This model switches to another thread when for example an interrupt, trap
or message arrives which often is triggered by an external event.
• Switch on use. The switch on use model switches when an instruction tries to use a value that
is not yet fetched from memory (for example due to a cache miss), or otherwise unavailable.
A compiler or programmer which has knowledge about this can take advantage of this and
try to load the values requested as soon as possible before they are needed. To implement
this, a “valid” bit is added to each register which will be cleared on a memory load and set
when the value is available.
• Conditional switch. This will lead to a context switch only when a condition is fulfilled. This
condition can for example be defined as if a group of load/store instructions resulted in any
cache miss. If all load/store instructions resulted in cache hits, the thread will continue its
execution, but if any resulted in a cache miss, the processor will do a context switch after
the last load/store instruction. When the control returns to that thread, hopefully all values
are available that was requested prior to the context switch.
Figure 2.8 shows possible places for a context switch using three different models.
switch on load
load r1, [mem]
load r2, [mem]
load r3, [mem]
add r4, r1, r2
load r5, [mem]
add r6, r3, r5
load r7, [mem]
add r8, r7, r7
add r9, r4 r6
mul r10, r8, r9
explicit switch
switch on use
(possible) context switch points
Figure 2.8: Possible context switch points for three different BMT models.
Multithreaded Multi-issue Processors
Combining multithreading with multi-issue processors such as superscalar or VLIW processors can
also be a very efficient design solution[16].
The problem with multi-issue processors when it comes to efficiency is that they are limited
by the instruction dependencies (i.e. the instruction level parallelism, ILP), and long latency
operations. The effects are called horizontal waste (due to issue slots not being filled because of
low ILP) and vertical waste[16] (due to e.g. long latency operations). This can be seen in figure 2.9
Issue width
Horizontal waste
(total: 11 slots)
Vertical waste
(total: 15 slots)
Figure 2.9: Horizontal and vertical waste.
Multithreading, when used on single-issue processors, can only attack vertical waste (since
there exists no horizontal waste on single-issue processors), while on multi-scalar processors, both
vertical and horizontal waste can be reduced.
There are a number of possible design choices for multithreading together with multi-issue
• Fine-grain multithreading, where only one thread can issue instructions every given cycle, but
can use the entire issue-width of the processor. This is normal multithreading as described
earlier and will effectively reduce vertical waste but not horizontal waste.
• Simultaneous multithreading with full simultaneous issue. This is the least realistic model.
Simultaneous multithreading works so that all hardware threads are active simultaneously
and competing for access to all hardware resources available. When one thread has filled
its issue slots for a certain cycle, the next thread can fill the remaining slots. This will be
repeated for all threads until there are no more issue slots available (i.e. they have all been
filled), or there are no more threads. The order in how the threads are allowed to fill the
slots can be decided by different priorities among the threads, or cycled using for example
round robin scheduling to result in a fair distribution.
• Simultaneous multithreading with single issue, dual issue or four issue. In these cases, the
number of instructions every thread can issue is limited. For dual issue, a thread can issue
two instructions and filling an 8 issue slot processor requires at least four threads.
• Simultaneous multithreading with limited connection. In this case, a hardware resource can
process instructions from a limited number of threads. For example, if there are eight threads
and four integer units, every unit is connected to two threads. The functional units are still
shared, but the complexity is reduced.
Using Simultaneous multithreading, not only can unused cycles in the case of latencies be
filled with instructions from other threads, but also unused issue slots in every cycle, thus both
exploiting instruction level parallelism (ILP) and thread level parallelism (TLP), as can be seen
in figure 2.10. A well-known implementation of this is the Hyperthreading features on modern
Pentium 4 processors.
Issue width
Thread 1
Thread 2
Thread 3
Figure 2.10: Simultaneous multithreading.
Multiprocessor Systems
Multithreading can be seen as a type of parallel hardware that exploits program parallelism by
overlapping the execution of multiple threads. Multiprocessors and multithreading both exploit
parallelism in the application code. This is done somewhat different in the two techniques since
multiprocessors execute multiple threads in parallel on multiple hardware resources such as processing elements and caches while multithreading execute multiple threads on the same hardware
resources. The performance of multiprocessor systems will always outdo the performance of multithreading when having as many threads as processors, but this is not a fair comparison since the
hardware costs are larger for the multiprocessor system since everything must be replicated (functional units, decoder logic, caches etc). Multithreading on the other hand overlap the execution
instead of performing execution in parallel which only requires a part of the processor to grow with
the number of threads.
Communication between processors is also a non-trivial problem compared to the case of multithreaded processors. Multiprocessor systems will have to synchronise threads and communicate
using a bus or a shared resource, such as an external memory. There will be no simple way of
signalling between them, except hard-wired interrupts. Multithreaded systems can communicate
faster and easier using internally shared resources, such as internal memories or shared registers
and internal exceptions can be used for signalling.
Analytically Studying Multithreading
It would be an advantage if it was possible to do an analytical analysis of the gains of multithreading
for network processors. This can for example be used for guaranteeing a certain line-rate when
a worst-case execution time is known, or for doing an initial estimation to see if multithreading
can be worth investigating further. Unfortunately, the complexity of real-world systems makes it
practically impossible to come up with an exact answer. The uncertainty of the surrounding system
such as packet distribution complicates this further which forces us to simplify the description and
constraints of the systems.
One possible estimation is to statistically calculate how much utilisation a system will have
given different multithreading aspects, as described in [14].
Given the total number of cycles for memory and coprocessor accesses for the entire program,
MemC and CopC, then the total amount of time a thread will have to wait will then be, under the
assumption that the resource is always available (which will be a large limitation):
K = M emC + CopC
If the total number of cycles required to execute the program is T (including the time the
thread will have to wait for an external resource), the probability p that a thread is waiting for a
latency source is:
p = K/T
If the processor has n threads, the probability that all are waiting at a certain time will be pn ,
under the assumption that there are no thread dependencies. In other words, the probability that
not all threads are waiting, or that at least one thread on the processor is performing something
will be 1 − pn .
When a thread is performing something, it can either do something useful (computation) or
perform a context switch, which takes time. The probability that a given thread is in the switch
state, that is the probability that it is blocked times the probability that any other thread can run,
is p · (1 − pn ). Using this, the probability that the processor is in a thread-switch state, q, can be
derived as, given that the time to perform a thread switch is C and the total number of latency
sources are L:
q = (L · C)/K · p · (1 − pn ) = L · C/T · (1 − pn )
The processor efficiency, that is the number of cycles the processor is doing something useful,
can then be expressed as:
= (1 − pn ) − q = 1 − pn − L · C/T · (1 − pn )
= (1 − pn )(1 − L · C/T )
= (1 − (K/T )n )(1 − L · C/T )
This will be an optimistic approximate since thread and resource dependencies are neglected
as described earlier. The characteristics of the latency resource will also be omitted which can have
a great impact of the real utilisation of the system. However, it will be a good estimate considering
how little information you have to provide.
Another approach was taken in [9] where the worst case execution time was analytically
computed using integer linear programming. It requires thorough investigation in how the program
works and has some major limitations that are needed to make it mathematically solvable. Some
limitations are in the application (no recursion is allowed or unbounded loops, but this is not
common anyway), the scheduling must be strict round-robin, which is also very common, and
only one latency source that all threads work on is supported. The computed WCET (Worst
Case Execution Time) figures differ from simulated WCET between 40-400% depending on the
As shown above, it is only possible to analytically describe very small systems and with many
limitations, which brings us to the conclusion that for now, simulations are still the fastest way to
come up with estimations of multithreading impacts on large systems.
The Downside
When multiple threads are running on the same processor, memory accesses will be more frequent
and the number of cache accesses will also increase. When the threads are running in different
portions of the code, many cache accesses will result in cache misses and this will reduce the
performance of the system.
A simple solution to cope with instruction cache is to have a private cache per thread. This
can also be considered for data caches, but this imposes a few problems. If the same process is
parallelised over multiple threads on the same processor, the memory accesses will often occur in
the same memory area. With private caches, it is important to maintain a cache coherency so that
no false data is present. False-sharing can happen, as in multiprocessor systems, unless all caches
are updated (either flushed or the updated with the new value which will have to be propagated
to all caches) when writing to one cache memory. Since this is difficult to implement efficiently,
private caches are generally not used in multithreaded architectures[4].
An intuitive solution is to instead have a shared cache for all threads in the processor. This
however often leads to an increased cache memory miss rate, especially for threads that are running different processes which operate in different memory areas, but whose location in the cache
memory is the same and will collide. Since the increased cache memory miss rate is the largest
downside for multithreaded systems compared to multiprocessor systems, the cache memory size
and implementation can be changed to improve performance. A multithreaded processor with n
threads and m kb cache memory and a multiprocessor system with n processors each with m kb
cache memory will be an unfair comparison. [3] and [15] has tried to equalise the cache memory
size in the multithreaded processor to n · m kb and compare it to the multiprocessor system. Both
systems now have the same total amount of cache memory in the system, and in most cases, this
lead for the multithreaded system to an improvement of 33%-94% less cache memory misses in
average. This is very application dependant and in some cases when the cache memory miss rate
was already low, this did not lead to large improvements. In some cases the performance even
Another solution is to increase the number of set-associativity to n, while keeping the total
cache memory size at n · m. This leads to, according to [3], to 3.5%-88% reduction in cache misses
compared to the previous solution when the cache memory size was the same.
Other Techniques for Hiding Latencies
Memory latencies have traditionally been hidden by caches, and a lot of research has been done
in this area deciding cache schemes and sizes. Caches however, do not actually hide memory
latencies[1], but instead they try to eliminate as many of the long latencies as possible and try
to minimize the remaining memory access that do not result in a cache hit. If the code has low
temporal and local locality then it will suffer low cache improvements and this technique is then
no longer useable.
If the program has knowledge that it will use data stored in a given location in the future, then
it might be a good idea to introduce memory prefetching to the architecture. If the data is
prefetched long enough before it will have to be used, it will have been fetched from the memory
network in time for the processor to use it, thus resulting in no penalty. This is particularly
useful for complex interconnection networks with high latencies. For less regular code, such as
traversing data structures (trees, linked lists etc), the location can not be predicted soon enough
and prefetching will be impossible.
Multithreading for Network Processors
Assigning what each thread does on a network processor can be done in numerous ways - each
with its advantages and disadvantages.
• One thread for each layer
• One thread for each protocol
• Multiple threads for each protocol
• Threads for protocols plus management tasks
• One thread for each packet
One thread for each layer
Dividing the protocol processing operation into separate threads for each layer has a few advantages.
First of all, the code will be smaller and simpler, but since all protocols for a certain layer will
have to covered, the code will still be large for multiprotocol systems. It is also possible to assign
different priorities to different layers, so that lower layer protocols, e.g. layer two, have higher
priorities than higher layer protocols, for example layer three.
A disadvantage is that each thread must handle both incoming and outgoing packets which
will complicate the code. The way packets are passed between the layers, i.e. threads, will also
introduce a lot of overhead. A packet will normally (depending on the application of course) be
processed in many layers, and when the packet is handed over from one thread to another, the
threads will either have to be synchronised, or buffers will have to be put between them. In either
case, this will take some additional time.
One thread per protocol
To make the code even smaller and simpler, it is possible to have a certain protocol have its
own thread. The code will be easier to understand, and it will be possible to assign priorities so
that UDP-processing has a higher priority than TCP-processing since UDP packets normally have
require lower latencies for e.g. VoIP (Voice over IP) or similar. The disadvantages for the previous
example are also applicable here.
Multiple threads per protocol
This is often used to split up the processing of a protocol into different directions. The designer
can then assign higher priorities to outgoing packets for a system to avoid congestion.
Protocol processing plus management tasks
There are some common time-critical operations that have to be performed, especially for higher
layer protocols. Some of them are retransmission timers, reassembly and route updates. A dedicated thread for these operations can significantly simplify the design and programming, and
collecting all timer management for all layers in a special thread might even more simplify the
design. But since the rate of timer events are very different among the protocols, ranging from
minutes between router updates and seconds between packet timeouts, this might be difficult.
One thread per packet
The disadvantages described earlier for layer threads and especially the overhead from thread
switching when passing packets between layers or protocols can be avoided by partitioning the
threads in another way. If a certain thread works with a packet for its entire lifetime in the system,
the overheads are avoided, but it also requires all protocol processing code for every layer and
every protocol to be available by the packet threads. This is probably the most common technique
in network processors.
System Level Methodology
Modern embedded systems, such as signalling and network processing, are increasingly becoming
more complex and together with future unknown requirements, programmability is an unavoidable
requirement[8]. However, performance, cost and power requirements are also very important which
implies that parts of the solution must be made with dedicated hardware components that offers
better performance, power etc, but less flexibility. This leads to more and more systems are
heterogeneous, i.e. they consist of both programmable and dedicated components. The increasing
complexity of these systems requires that tools suitable for modelling, simulation and benchmarking
must be available to cut down the development time. The less time that has to be spent on the
early levels (at system level), the shorter time to market can be achieved.
Application modelling
In the beginning of a new system development, detailed knowledge of the application or the architecture is very rare. No compilers or simulator exist making it difficult to benchmark a program,
and in many cases, no definitive input data or application exist either, making it impossible to generate a program trace that can be used for simulating. When an application does exist prior to the
development of the system, traditionally the system was constructed around the given application.
This leads to the results that the system has very good performance for the given application, but
if it changes later on or more applications are introduced to be used on the same architecture, the
new performance results can be a lot worse or even unusable.
Architecture modelling
In the past, embedded system developers have almost only worked with VHDL models which only
give a few abstraction levels to explore and limited design opportunities[18]. In order to cut down
time-to-market, an exploration methodology should be used.
Many system level design exploration tools offers to efficiently explore different architectures
by starting with an abstract, yet executable model, and iteratively refine it to be able to find
optimal solutions. The tools often make it possible to test many different applications for a given
architecture, and also to test the applications on different architectures, which often is of heterogeneous nature, without having to completely rewrite neither the architectures nor applications.
Design Space Exploration Models
Y-chart Model
Figure 2.11: The Y-chart model for design space exploration.
At system level, a common and well proven methodology is the Y-chart approach[7][18], which
allows reusability for efficient design space exploration. In this model, the architecture and applications are modelled separately and multiple target applications can be mapped on after another
onto the architecture components available. The result after the performance analysis should be
used to refine the architecture, applications and mapping to fulfil the requirements of the system.
The Y-chart model can be seen in figure 2.11.
Design Space Exploration Pyramid
abstract executable
Cost of modelling (time)
Abstraction level
back of the envelope
Design space (alternative realisations)
Figure 2.12: The design space exploration pyramid.
Another model is the design space exploration pyramid[18], which shows how you iteratively
can explore the design space and finding an optimal solution. It can be seen in figure 2.12. High up
in the modelling abstraction, the costs are small (it takes little time to model a system), and the
design space is large. After iterative tests and going further down the pyramid to lower abstraction,
the design space will be narrowed and finally an optimal solution should be found.
Area Efficiency
Chip Area Estimation
The chip area equation for a large network processor is expressed in [5] as:
areanp = s(io) +
s(mchl) + s(pj,k , t) + s(cij,k ) + s(cdj,k )
Where s(io) is the area for I/O (common for all processors), m is the numbers of processor
clusters, s(mch) is the memory channel, s(p) is the size of the processors and s(c) is the size of the
caches. Only the largest area contributions are described.
For a multithreaded protocol processor, the area can be divided into two parts; one part of
the area is depending on the number of hardware contexts and one part that is independent of
• The base processor logic, such as processor control, processing units, branch prediction etc,
will be constant and independent on the number of threads and this can be called pbasis .
• The second component is the hardware contexts (all registers, flags and similar) and the
other logic associated with a thread, called pthread . For a small number of threads, and for a
simple thread scheduling and control policy, this can be approximated as increasing linearly
with the number of threads. This will be inaccurate for exotic solutions such as multi-issue
multithreaded processor, but in the general case, this is not an issue.
This leads us to the simplified description of one (1) protocol processor:
s(t) = s(pbasis ) + t · s(pthread )
Where s() is an area function, t is the number of threads. For the Infineon PP32 AND ARM7,
the two area components have been acquired, but due to the confidential nature in these numbers,
they can unfortunately not be published here.
Defining Area Efficiency
For on-chip solutions in embedded systems, best performance is not always the final goal of a
system. Due to limited area on the chip, the final solution must be fast relative to the area
it consumes. By having a good performance per area, it will be possible to later scale up the
performance by using multiple systems running in parallel, for some extent. We call this number
area efficiency and maximising this number for a variable number of threads will be the goal of the
first simulation later on.
One possibility is to define the area efficiency as IPS/area unit, and if the processor utilisation
is ρp , this area efficiency can then be defined as:
λae =
ρp · clkp
s(p, t)
Chapter 3
Previous Work
SPADE[8] (System level Performance Analysis and Design space Evaluation) is a tool that can
be used for architecture exploration of heterogeneous signal processing systems and follows the
Y-chart paradigm, which represents a general scheme for design of heterogeneous systems.
The application, which is described in C or C++, is transformed to a deterministic Kahn
Process Network (KPN) by structuring the code using YAPI, which is a simple API. The designer
must identify computational blocks and surround them with the API functions. Examples of blocks
are for example DCT and other well-known algorithms within signal processing. The application
is then ran using real life data and a trace is generated that represents the communication and
computational workload. SPADE then uses that trace in its simulation since it is using trace-driven
simulation technique.
In the architecture you will have to estimate how long time the computational blocks (applications) requires on the architecture you model. The architecture file will contain all processors
you can select from and the time it takes for all tasks. You also specify the interfaces between
the blocks, for example if they should be point-to-point or via a shared memory. The bus width,
buffer sizes and memory latencies must be defined.
With SPADE, the mapping of application to the architecture is performed using one-to-one or
many-to-one. If the designer wishes to distribute a computational workload over several computational resources, i.e. many-to-one, the designer has to rewrite the application so that the workload
can be split into two processes.
StepNP[13] (SysTem-level Exploration Platform for Network Processing) is a tool that can be used
for simulating network processor systems in an early stage. It’s using ISA simulators together with
a system framework written in System C. There are models available for some common processors
such as ARM and PowerPC and multithreading capabilities can be implemented in these models.
While System C is of a lower abstraction level, the systems can be simulated in different abstraction
Since it’s using ISA simulators, the user must have detailed knowledge of a system and this
limits the design space exploration in a short amount of time. The research team is planning to
add support for configurable processor models from e.g. Tensilica or Coware, but they will still
have to be described with high detail.
NP-Click[12] is a programming model which makes it easier to write code without having to
understand all of the details of the target architecture. Traditionally, a programmer will have to
use assembly language or a subnet of C when programming a network processor. This low-level
approach requires that the programmer has deep knowledge of the entire architecture details to
be able to implement an efficient application without encountering too many problems. NP-Click
tries to add an abstraction level on top of the Intel IXP1200, which is a network processor with
many protocol processors and other processing engines.
The abstraction layer, which they call a programming model, hides the underlying architecture and only exposes enough details so that efficient code can be written. It tries to create a
bridge between an application model, which in this case is based on the Click engine, to the architecture. Click is a domain-specific language designed for describing networking applications[10],
which is the abstraction level where the other systems described above work. Applications are
built by composing computational tasks which corresponds to common networking operations, like
classification, route table lookups etc.
Some problems can occur when trying to reduce the amount of manual work required by the
programmer. High concurrency together with bounded buffers can sometimes lead to deadlocks
and for performance reasons, some aspects such as mapping elements to threads and utilising the
memory must still be made by the programmer.
Even though NP-Click works at a lower abstraction level and itself requires to be made for a
specific architecture, in this case the IXP1200, it can create a bridge between a high abstraction
level and the underlying hardware. This is one (or a few) steps down the design exploration
pyramid as shown on page 30 and has its origin from a domain-specific language that can be used
to describe a large range of possible design solutions.
Neither of these systems are publicly available.
Chapter 4
Proposed Solution - EASY
Difference from Previous Work
Since the solutions described in Previous Work on page 33 are not suitable for the requirements,
a new simulator framework was built to make it possible to simulate network processor systems
specialised in multithreading, in a high abstraction level. It is called EASY, which stands for Early
Architecture modelling and Simulation sYstem.
The key differences and features that make EASY suitable are:
• High level coarse grained modelling.
• Designed for multithreading.
• Designed for multiprocessor systems.
• Architecture modelling specialised towards network processor systems.
• Application modelling specialised towards protocol processing.
• Data independent simulation
High Level Coarse Grained Modelling
The other simulators are mostly trace-driven simulators that require an already defined ISA (Instruction Set Architecture) and also an ISS (Instruction Set Simulator). EASY does not require
this knowledge or tools and can because of this simulate a large amount of systems that otherwise
are very difficult to explain and model in detail. The proposed simulator sacrifices some accuracy
for a greater ability for design space exploration. The architecture both in macro view (the composition of processors/memories/coprocessors) and in micro view (how the processors/memories etc
behave) can be expressed easily, but with less details, thus requiring no fixed defined architecture
or tools.
Designed for Multithreading
Among system-level design exploration simulators, multithreading is a feature that can not be
found according to the author’s research in the subject. In the proposed simulator, the coprocessors, memories and processors can be described with parameters that affect how multithreading is
handled. This includes specifying how requests to latency sources (such as memories) are handled
and how the context scheduler on a processor behaves.
Designed for Multiprocessor Systems
According to [6], trace-driven simulations will be too inaccurate when expanding the system from
single processor systems to multiprocessor systems. The proposed simulator works with program
driven simulations, modelled using microinstructions (described in Appendix A on page 69).
In the proposed simulator, all resources such as memories and coprocessors can be shared
among an arbitrary amount of processors. Request to these resources will be handled in a way that
resembles a real-world system, including queuing, latencies and throughput. Simple communication
and synchronisation can also be modelled among processors in high-level abstraction.
Architecture Modelling
The architecture is specialised towards network processing. The most common elements in network
processor systems are available and using these key components, more complex systems can be
Application Modelling
The modelling language is specialised towards protocol processing. Network tasks, as described
on page 11 can be described in a straightforward way using microinstructions, as described in
Appendix A on page 69.
Dataindependent Simulations
The simulator does not work with actual data, which in this early design phase normally isn’t
available to the designer. This will decrease the accuracy of the results when the data is available,
but allows the designer to get any results at all when the data is unavailable, which is the target
situation for this system.
A full list of features can be found in Appendix C on page 85.
System Modelling
The model of a system is all the hardware that the entire system is composed of and the software
running on processors. This includes shared memories and coprocessors, processors with private
memories and coprocessors, ports and similar. To reduce the time it takes to describe a system,
Architecture file
Simulation setup
Cost functions
Figure 4.1: An overview of the simulator.
most resources except advanced coprocessors and processors are purely defined in the architecture
A system in EASY is modelled in three steps, or three abstraction levels.
• Defining Resources
• Describing Resources
• Connection Resources
Defining Resources
In this step you will have to define how many processors, how many memories and how many
coprocessors you will need for the task. This can later very easily be changed, so you will not be
limited by this initial choice.
Describing Resources
After you have determined what type of resources you will need, you must describe how the
behaviour of the resources, and for example how they will affect other resources. For simple
components, this is the only description of the resource which makes it easy to test different
options to refine simulation results. For more advanced resources, such as processors, you will also
need to specify the applications that run on these processors. This will be described in detailed
Connecting Resources
This is a large part of the mapping, which is an important input in the Y-model described on
page 30. In this part, you will have to connect the coprocessors and memories to the processors and
advanced coprocessors in the system. The resources can either be shared among many processors,
or used solely by a single processor. This step, as well as the earlier, can very easily be changed to
examine different system setups.
The syntax of the architecture file can be found in Appendix B on page 75.
Memories are handled in the simulator as pure latency sources. Their operation will not affect
other resources in the simulator, so they are slaves that only serve requests for other resources
which can be seen as masters. Memories can either be global and used by multiple processors, or
private to a single processor.
Defining read and write operations
Read and write-operations are the two basic operations that you can perform on a memory. If this
should imply a limitation, it is possible to add more operations, for example reading multiple words
or testing bits in memory, but for now, only read and write operations are defined. You can define
how long time the memory will be blocked for a single read or write operation, but the actual time
a processor will be blocked after a request can be longer than this number due to queued requests.
You can also specify if the processor issuing the request will be blocked completely or if only the
current thread will be blocked. In the latter case, a context switch can occur.
Queuing requests
Issuing an operation on a memory will result in the memory being busy for a certain amount of
time. In the case when a memory is shared among several processors, the requests can be queued
in many ways.
Queued requests to memories behaves in the same ways as for coprocessors, as described on
page 40.
Cached Memories
It is also possible to define a cache memory which processes requests in front of the real memory.
In that case, you can define how long time a cache-hit read or write operation takes and if the
processor will be blocked or can perform a context switch during this time. If the requests results
in a cache miss, the normal read or write delay will be used instead. You can also specify a cache
miss ratio, which is entered as a percentage number.
By setting a cached read operation to block the processor for a low amount of cycles, and
a not cached read operation to only block the current thread of a processor, you can simulate a
“context switch on cache miss” as described on page 21.
Cache Memory Limitations
Data cache as well as instruction cache is extremely dependent on the actual data fetched from the
memory. The data characteristics in both temporal and spatial locality will effect what is stored
in the cache memory and when and how often a cache miss happens. A certain algorithm might
always never cause a cache miss while another can cause a lot of cache misses.
In the simulator, we can not specify when a cache miss occurs, but we can specify how often it
occurs, in a percentage. This is often the parameter you measure when you select a cache memory
size to increase the system’s performance, and this will always be an average number which you in
most cases can not be sure will be fulfilled by the actual application running on your system. Due
to this, cache modelling can be justified, if the person modelling a system is aware of the extreme
limits it poses. If the results can be used for something useful might be case dependent and is hard
to answer, but the possibility is there anyway. An accurate number can often only be acquired by
running a bit-true and cycle-true simulator with support for cache memories. Even then, the input
stimuli to the simulator you use can affect the cache behaviour so much that if you don’t have
input stimuli that can represent real-life data, the results you get could still be too inaccurate.
The instruction cache misses are completely random and the only exception is that the same
instruction can not get a cache miss two times after each other.
The data cache does not have this exception since you can not specify a memory location
where to read, so two consecutive memory read accesses can both lead to a cache miss even though
the real code would force the next memory read to be cached.
Coprocessors are another big source of latencies that reduces the efficiency in non-multithreaded
processors. They are often used as accelerators for a host processor and their purpose it to process
an amount of data faster than the host processor could do on its own, for example a CAM lookup
or a CRC engine. They could also be used for actions that the host processor can not perform
at all, for example a high-resolution and accurate timer or a semaphore engine for several host
In the simulator, they can be modeled in two different ways, either as pure latency sources
or as advanced processing engines that interact with an environment (talks with memories, uses
other coprocessors and more) as desribed in 4.4.2.
Pure latency source
This is the default way to model a coprocessor and can often be used with a fair accuracy. In
this modelling approach, you specify how long time an action would take on a coprocessor, for
example how long time a lookup would take on a CAM. Since the CAM can be thought of as a
black box with low interaction with the environment (the CAM must get a “key” to search for,
and the processor issuing the request will receive the results), this is a fairly good model. The
positive effects of using this approach is:
• It is easy to model the coprocessor (simple configuration file)
• It is easy to specify the latencies which will reduce the risks of making mistakes while modelling.
• It is easy to have a coprocessor with numerous actions
• It is possible to have queued requests with good flexibility
• It is possible to model a pipelined coprocessor that processes several requests in parallel
pipeline stage 3
pipeline stage 2
pipeline stage 1
• It is possible to model a load-balanced cluster of coprocessors, with some limitations.
Processor 1
Processor 2
Processor 3
Processor 4
request is
Figure 4.2: A pipilined coprocessor that is shared among four processors.
Pipelined processors can be modelled using this approach. If the pipeline length is n, and
the time to finish the entire task is t, then the rtr_time = t/n and the max_parallel parameter
should be set to n as described in B.7. This can be seen in figure 4.2
Processor 1
Processor 2
Processor 3
Processor 4
minimum time
between requests
Figure 4.3: A load balanced clusted of three coprocessors with a setup time.
A load balanced cluster of coprocessors could be modelled also. If the number of coprocessors
are n, and the time it takes for a task to be distributed to a coprocessor is t, then you should set
max_parallel to n and rtr_time to t. This can be seen in figure 4.3
In all these cases, when the requests are more than max_parallel, the other calling processors
issuing the requests will be blocked until the coprocessor can accept a new request. They will also
be issued in the order they requested the action for, which might not be the case in real life. This
however should not be a devastating limitation.
Advanced processing engine
The second way to model a coprocessor is to see it as a real processor that makes an impact
on the environment in which it is instantiated. In this approach, you must yourself model the
coprocessor’s actions using microinstructions. This makes it possible for you to specify how the
coprocessor communicates with for example a memory and other coprocessors thus giving a more
accurate description on the coprocessor. The positive effects of this approach is:
• You can more accurately model a coprocessor and its effect on the environment
• It might be easier to describe the coprocessor’s actions using a code description
You will be limited in the way the coprocessor can queue up requests, and it is not possible
to do any pipelining or similar parallelism, but using semaphores and conditions it is possible to
model multi-action coprocessors that behave in the same way as the previous approach.
A full description on how to model advanced coprocessors can be found in appendix B.4.2 on
page 77.
Processors and advanced coprocessors, which are a special type of a processor, are the only active
components in the simulator. This means that they can affect passive resources, such as memories
and coprocessors, and also other active components such as processors. Memories can affect processors since a memory access will require the processor to wait for an amount of cycles that the
memory decides, but the initial request was taken by the processor.
Application Code
Processors continuously execute application code when they are running, and the type of code is
described in detail in Appendix A on page 69.
The most interesting feature of the processor description is the native support for multithreading.
An unlimited number of threads is supported, and each thread can run their own application, if
it is wanted. A context switch to another thread can be performed for a large number of events,
such as:
• A coprocessor request that will block a thread
• A memory access that will block a thread
• A semaphore instruction which forces the thread to wait for an external event
• Forced by application code (a switch instruction)
• A port access that will block a thread
• An interrupt event occurring (this is handled somewhat different, which will be described in
detail later)
• An instruction cache miss that will block a thread
• At any time if the scheduler wishes to switch thread
It is possible to specify for each and every item in the above list if the processor will perform
a context switch or not.
The context switch handling is controlled by a scheduling algorithm that can be specified for every
processor in the system. The default scheduling algorithm is fair round-robin scheduling which is
the most common algorithm among multithreading processors, and together with a dummy scheduler that never performs a context switch, these are the defined scheduling algorithms. It is possible
to add a custom-made scheduling algorithm if this is wanted, and an interesting implementation
would be to use IMT (Interleaved Multithreading - see page 19).
Manual Context Switch
It is also possible to control the context switching manually in the application code, using the
switch instruction. This can be used to model a multithreaded processor which has no hardware
scheduler, thus implementing multithreading with software support.
Context Switch Overhead
It is possible to specify the delay when the processor will be blocked when performing a context
switch. The default is to perform it with no overhead, i.e. zero cycles. It is also possible to model
a multithreaded processor with a few hardware threads and a larger amount of contexts stored in
an SRAM memory. In that case, you can specify the possibility for a fast context switch, with the
fast switch delay, and the slow switch delay in the remaining cases when loading from SRAM.
Connected Resources
Internal Resources
Memories, coprocessors and ports can be internal to a certain processor. This means that only
the processor where the resources are declared can operate on the resources and their name does
not have to be globally unique as if the resource was defined in the global namespace. This makes
it easier to increase the performance a system by increasing the number of processors performing
a task using memories and coprocessors. Increasing the number of processors also increases the
number of memories and coprocessors so that the task’s performance scales as well.
Shared Resources
It is possible to connect globally defined resources, such as memories and coprocessors. The
application code can then use the resources using their globally unique name. How the resources
behaves when requests are issued from several sources is written under the respective resource
Since it is not uncommon for management tasks to be implemented in network processors using
interrupts, this is supported as well. In some architectures, multithreading is used to speed up
interrupt execution since the register file in that case does not have to be saved to and restored
from a data memory.
Interrupt Routines
The processor model in the simulator allows an unlimited number of interrupt routines to be
defined. Every routine can either be called at a defined interval, which can be expressed as an
exact number or a uniform interval. A routine can also be invoked manually in an application
code running on the processor. Support for different interrupt priorities is also added to allow
specifying if an interrupt may execute when another interrupt is already running. There is an
unlimited number of interrupt priority levels.
Interrupt overhead
The overhead when switching to and from an interrupt routine can be specified. During this time,
the processor normally is blocked while it will save the contents of the registers to a local memory
when entering an interrupt routine, or restoring the contents of them when the interrupt routine
is left. Using multithreading, this interval can be very short since no registers has to be saved to
memory, while in other cases, this might be a large number. In the simulator, it is possible to
specify the overhead, and it is also possible to specify an individual overhead for each and every
interrupt routine, in case some interrupts uses hardware contexts and some are stored in a local
Instruction cache
Basic support for instruction cache has been implemented to allow modelling of for example ARM9processors which have instruction-cache on-chip. Normally, there is no additional delay when
fetching an instruction, but it is possible to specify an instruction cache miss percentage, and the
delay which results from an instruction cache miss. Please read more about cache memories in
“Cache memory limitations” on page 39 for a description of what this can be used for and what
limitations it introduces.
Ports are an important component in a processor when networking processing is discussed. Since
the processor communicates with for example coprocessors and external memories as well as getting
packet data using ports, how a network processor deals with this can have a significant influence
on performance.
Defining read and write operations
As with memories, the read and write operations are the two basic operations that you can perform
also on ports, even though this can be expanded in the future if this impose a limitation. You can
define how long time the memory will be blocked for a single read or write operation. You can also
specify if the processor issuing the request will be blocked completely or if only the current thread
will be blocked. In the last case, a context switch can occur. It is possible to set up queuing for
ports as well just as for other resources.
Chapter 5
Simulation 1 - assemc
This is a “proof of concept”-benchmark to see how the simulator works and what results it can
produce. The model and system is simple to reduce the possible error sources and make it easier
to change it, but still complex enough to make it difficult to guess what the results would be.
The assemc is a real-life application created by Infineon Technologies and used for benchmarking their protocol processor and compare to other similar processors. It is fairly small but
still representative for a typical protocol processing application. Without going into many details
about the benchmark, a short overview is needed.
The benchmark reads a MAC address and a VLAN tag from a port connected to the protocol
processor. This pair is used to create a key to a CAM, and a lookup will be performed in that
CAM. The processor can either wait until the results are available, or switch to another thread in
the meanwhile if multithreading capabilities are present.
When the results are available, the return value from the CAM will decide what should be
done with the packet. Either the processor issues some commands and finishes there, or the packet
is processed and sent to a FIFO buffer for later processing of the packet.
This benchmark is made to be portable and to compare different architectures, so we will
model this benchmark for two different processors, the protocol processor from Infineon, PP32,
and a common general-purpose RISC CPUs from Arm technologies, ARM7.
Application modelling
The benchmark source code was available as high level C-code, compiled assembler for both PP32
and ARM, and as hand optimized assembler for PP32 and ARM. The benchmark must be modelled
in a way that is architecture independent and rather fine-grained to easier validate that it is accurate
enough. After having analysed the C-code and the different assembler versions, some frequently
occurring operations were extracted that could be made to build the target benchmark with. These
were then described in microinstructions (see Appendix A on page 69) for both the PP32 and ARM
Architecture modelling
Figure 5.1: Architecture overview for a number of PP32 processors.
The architecture contains of a number of processors (which will vary in the different simulations) and all are connected using a shared bus to a coprocessor. They all have their own internal
memory so the only shared resource will be the coprocessor and the performance of the entire
system should be limited by this shared resource when we increase the number of processors in the
system. A typical architecture overview can be seen in figure 5.1.
The ARM7 processor is a 32-bit embedded RISC CPU created by ARM Ltd. The processor has
a very small die size and with its three stage pipeline, it can run at speeds around 133MHz at
0.13um process. It has a general purpose 32-bit instruction set.
The PP32 processor is a 32-bit embedded RISC CPU created by Infineon Technologies customised
for protocol processing. The instruction set has special instructions suitable for processing e.g. IP
packets with real-time constraints and can communicate with memories and coprocessors in a very
efficient way. Since this processor is not made publically available, there are few details that can
be shared.
The CAM is a modelled as a pipelined coprocessor with three pipeline stages, as described on
page 40. In other words, the CAM can process three different requests concurrently. The latency,
and throughput, of the CAM operation will be varied in the simulations to find the optimum
number of threads for each given coprocessor latency.
The simulation goal
The goal of this simulation is to evaluate what system setup that leads to best area efficiency as
defined earlier on page 32.
The benchmark is using different user changeable parameters that will change the environment of
the system. The parameters are:
• The type of processors (either ARM7, or PP32)
• The number of processors
• The area of the processors, using terms defined on page 32.
• The number of threads per processor (only applicable for PP32 processors)
• The coprocessor latency
• The clock frequency for the network processors
The simulation
It is interesting to know how the latency of the CAM coprocessor affects the area efficiency of
the system for three different setups of the system. So we will change the latency of the CAM
lookup operations and for every lookup time run a number of simulations for different architecture
• Case 1: A system with a varying number of ARMs that all share a common CAM.
• Case 2: A system with one PP32, one CAM and vary the number of threads for the PP32
• Case 3: A system with two PP32s, one shared CAM and vary the number of threads for both
• ...
• Case n: A system with n PP32s, one shared CAM and vary the number of threads for all n
This will be repeated for up to 6 PP32s.
A few graphs will be presented showing the area efficiency (see page 32) as a function of the number
of processors and threads for a system, as described in “The simulation”. The numbers on the Y
axis are removed due to the confidential nature of these numbers.
Please note that it is not performance that we plot, but area efficiency. If a system’s performance scales linearly, its area efficiency will be constant.
Area efficiency (packets/second/mm^2)
Area efficiency, CAM latency = 50000 ps
Threads (PP32) or processors (ARM7)
n x ARM7
1 x PP32, n x threads
2 x PP32, n x threads
3 x PP32, n x threads
4 x PP32, n x threads
5 x PP32, n x threads
6 x PP32, n x threads
Figure 5.2: The assemc benchmark with a CAM latency of 50ns.
For CAM lookup latencies lower than 60 ns as figure 5.2 shows, the coprocessor will be fast
enough to handle all requests for up to 20 ARM7 processors, or up to 5 PP32s with multithreading.
We can see this by noticing that the area efficiency is the same for any number of processors in the
system. However, multithreading will increase the area efficiency for the PP32 quite drastically.
There is a 33% area efficiency gain for two threads comparing to one thread (i.e. a single-threaded
processor) and about 20% area efficiency gain for three threads compared to one thread. For a
large number of threads (5 or more), the processor will no longer increase its efficiency and the
increased area for the contexts will result in less area efficiency than for a single-threaded processor.
When the CAM latency increases, the amount of time a processor has to wait for a request
increases thus leading to higher efficiencies for a multithreaded processor, since it can do something
useful while waiting. However, a problem will arise for a system with a large number of processors
issuing requests: The CAM will at one point reach a level where the rate of requests arriving to it
will be faster than the rate of requests it can process, which will result in the coprocessor becoming
a bottleneck in the system and limiting the total performance. For large latencies, this will happen
Area efficiency (packets/second/mm^2)
Area efficiency, CAM latency = 80000 ps
Threads (PP32) or processors (ARM7)
n x ARM7
1 x PP32, n x threads
2 x PP32, n x threads
3 x PP32, n x threads
4 x PP32, n x threads
5 x PP32, n x threads
6 x PP32, n x threads
Figure 5.3: The assemc benchmark with a CAM latency of 80ns.
Figure 5.3 shows how different system works with a CAM that has a lookup latency of 80ns.
For a system with ARM7 processors, the system will not scale linearly when there are more than
15 processors issuing requests to the same CAM. The same happens for a system with 5 PP32
processors which can also be seen in the graph. The increase in area efficiency for PP32 systems
with 4 or less processors have increased to a 50% gain from one thread to two threads due to the
increased latency.
Area efficiency (packets/second/mm^2)
Area efficiency, CAM latency = 120000 ps
Threads (PP32) or processors (ARM7)
n x ARM7
1 x PP32, n x threads
2 x PP32, n x threads
3 x PP32, n x threads
4 x PP32, n x threads
5 x PP32, n x threads
6 x PP32, n x threads
Figure 5.4: The assemc benchmark with a CAM latency of 120ns.
For long latencies combined with a multiprocessor system with many processors, the coprocessor might even be too occupied to serve all processors when they are single-threaded. In these
cases, there is no gain by having a multiple threads. This can be seen in the graph for 5 PP32s
where the area efficiency has decreased for two threads compared to one thread. The gain for a
single PP32 with two threads compared to one thread in this case, where the latency is 120 ns, is
close to 80%.
Area efficiency (packets/second/mm^2)
Area efficiency, CAM latency = 180000 ps
Threads (PP32) or processors (ARM7)
n x ARM7
1 x PP32, n x threads
2 x PP32, n x threads
3 x PP32, n x threads
4 x PP32, n x threads
5 x PP32, n x threads
6 x PP32, n x threads
Figure 5.5: The assemc benchmark with a CAM latency of 180ns.
Figure 5.5 shows the case where the latency is 180 ns, which can be considered to be a high
number. In this case for a few number of PP32s, the highest area efficiency will be reached for
three threads, compared to the previous examples where two threads were the optimum. The gain
from one thread to three threads for one processor is over 100%, showing how multithreading really
can improve the efficiency of a system.
106 85
120 98
Table 5.1: Gains for single-processor systems, in percent, with n threads compared to one thread
Even though the system was very small and easy to model, the results we obtained from the
simulator were not trivial to predict. For multiprocessor systems, the complications of resource
sharing and similar can now be modelled at system level in short time
Multithreading results
The area efficiency varies a lot depending on the number of processors, number of threads and
the coprocessor latency, which where parameters to the simulator. For short coprocessor latencies,
single-processor and multi-processor systems will gain when increasing the number of threads for
multithreaded processors up to two threads. Three threads and sometimes four threads will also
have higher area efficiency than for single-threaded processors, but the maximum is at two threads.
For large latencies, multiprocessor systems will soon produce too many requests for the coprocessor and make the coprocessor a bottleneck of the system. Multithreading for these systems
can lead to lower area efficiency. For single-processor systems, the best area efficiency is reached
when having three threads, which means that the optimum number of threads changes with the
coprocessor latency, which is logical.
Chapter 6
Simulation 2 - NAT
This benchmark is, in contrary to the assemc benchmark on page 45, created using a coarser
grained modelling. Its purpose is to show how you can model a complex system in a high level
fairly quickly, and still be able to get results that are necessary to draw conclusions from.
The target system is a high level flow chart performing a Network Address Translation (NAT)
and the source flow chart can be found in Appendix D on page 90. The system is processing data
from layer two up to layer five (application layer) and is using a large amount of coprocessors to
offload the network processing cluster. In addition to the pure network address translation, the
system identifies ARP packets at layer three, ICMP, IGMP and IPSEC packets at layer four and
DHCP, SNMP packets at layer five. These packets are forwarded to a host processor (for example
a general-purpose RISC processor) which will handle the special types of packets (not described
in this benchmark).
The system contains of one WAN-port that is connected to the Internet and has a public, valid
and unique, IP address. The system also has a larger number of LAN-ports that are connected
to client machines, which have private IP-addresses in the ranges as defined in RFC 1918, e.g. IP
addresses of the form 192.168.x.x.
Network address translation
Short Introduction
NAT, or Network Address Translation, is a short-time solution for the future lack of IPv4 addresses
(where IPv6 is considered to be the long-term solution for this problem). For a host to communicate
on the Internet, it must have its own unique identifier, called IP address. Due to the increasing
number of users connected to the internet every day, and due to the fact that there are a limited
and fairly low amount of available IP addresses that every ISP has access to and in global, this
is not always possible. NAT allows a single device, a computer running a software program or a
dedicated piece of hardware, to connect multiple hosts on the local area network and where all of
them share one, or a low amount of, public IP addresses.
How NAT works
There are multiple ways of how a NAT gateway performs the translation, for example static NAT
(every host maps to its own unique public address from a large pool of unique addresses), dynamic
NAT (the mapping is done dynamically), NAPT (will be explained in detail) and overlapping
(when handling address collisions due to both internal and external address being in the same
subnet). I have decided to use NAPT, or Network Address Port Translation, since it is the most
common way to handle the case where you have a number of clients far larger than the available
number of public addresses and has enough complexity to make it interesting to model.
WAN port
Shared resources
Upstream processing
Downstream processing
NAT mapping
ARP table
LAN port #1
LAN port #2
LAN port #n
Figure 6.1: Partitioning of the system.
There are two processor clusters taking care of the upstream and downstream processing respectively, as you can see in figure 6.1. The processor clusters are multiprocessor systems using multithreaded processors.
Packets from the LAN ports can either be destined for another LAN port, since the device also
works as a layer two (Ethernet) switch, or for the WAN port, which is connected to the internet. The ratio between these destinations is a parameter that can be changed to see how the
performance, utilisation and more will be different with different ratios.
In the case for packets being switched, the packet’s Ethernet address will be extracted, compared and looked up to find the port where the destination host resides.
For the rest of the packets that are destined for the WAN port, a thorough examination of
the packet is performed to see that it isn’t corrupt and if it should be handled in a special way.
IP packets will be reassembled, if necessary, and if the layer four protocol is TCP or UDP (this
amount is also a parameter), the packets will be translated. The packets will have their different
checksums recalculated for layer 4, 3 and 2, and if the packet must be fragmented this will be
performed. The packets will then be placed in an output queue and if the packet is considered to
be important, a high priority queue can be used.
Packets from the WAN ports can only be destined for the LAN ports, or for the NAT device itself.
Therefore, the flow is straighter in this case and the packets are checked so that they are not
corrupted (incorrect Ethernet CRC), the layer three and layer four protocols. IP packets will be
reassembled and other types, e.g. ARP, will be handled by the host CPU. This also applies for
other protocols in other layers, for example for IPSEC, DHCP and SNMP. TCP and UDP packets
will be translated, if there is an entry in the classification table for them. The packets will then be
fragmented, if needed, and put on the egress ports of the LAN ports where the destination hosts
A complete flow graph can be found in Appendix D on page 90.
Cost Functions
The cost functions, or action blocks, which can be seen in the flow chart on page 90, must be
modelled as procedures performing calculations, memory accesses, coprocessor operations and
similar. After describing in a high level what has to be done in every action, it should be fairly
straightforward to give estimations on how it would look in a lower level. In this case, there are
many already written algorithms available that you can use for estimating how the action would
be in compiled form. In other situations, this is not possible, and since there is no implementation
of network address translation for our system setup, most of the actions have been estimated.
The goal
The goal in this benchmark is to improve total performance and to find and eliminate all bottlenecks
that might not be apparent at the start of the modelling. We will count the number of packets
processed, which is the sum of all packets processed by both the upstream and downstream part.
Some packets will be discarded, but they are still handled by the system and will also be counted
in this number.
• Benchmark coprocessor utilisation to find bottlenecks
• Benchmark memory utilisation to find bottlenecks
• Get the latency for packets, which could be high for multithreaded systems
The benchmark is highly parameterised and it is easy to change the system environment to test
different scenarios. Some things that can easily be changed are:
• The number of network processors for the upstream processing
• The number of network processors for the downstream processing
• The number of threads for every network processor
• The clock frequency for the network processors
• All latencies, pipelines and more for the coprocessors
• If the packet processors should have an internal private memory for header manipulation or
use the global memory
• If there should be one global memory for the entire system or separate for upstream and
downstream processing
• The load of the system, i.e. the number of arriving packets
• The number of packets that are defected (wrong CRC-sum)
• The percent of IP packets that are fragmented
• The percent of packets on the upstream side that are destined for the local area network
• The percent of packets which are TCP or UDP packets, thus will be translated
After changing any parameter, the benchmark is ”compiled” and can then be simulated.
Possible settings and questions
• Only global memory, to which both coprocessors and network processors are connected to.
Is the memory fast enough or will it become a bottleneck?
• How much could be gained if a private small memory were used for the protocol processors
for header processing applications (HPA) while keeping the packet data in a global memory
for packet processing coprocessors, for example CRC.
• To reduce the memory overhead, two memories could be used - one for the upstream processing and one for the downstream processing. This requires that the coprocessors can access
both global memories but the network processors are strongly connected to either one. How
much would be gained?
• A more realistic optimisation would be to perform an inline computation of the CRC on
the packet data as its being received/transmitted, thus removing the need of computing
CRC from local memory. While this is a more ASIC solution, is it required to reach high
• How much are the coprocessors used? Could they be replaced with cheaper but slower
coprocessors and still keep the performance?
• At how many processors/threads will the system reach its optimum area efficiency?
• How will multithreading and the large amount of coprocessors affect the latency for processing
a certain packet?
Packet lifetime
Even though the simulator is stateless and does no real processing of packets (it does not modify any
data), you could investigate a packet’s way through the system. This can be useful to measure the
throughput, the latency of packets and where they end up in the system. The last measurement can
also be used for troubleshooting the system and making sure that the flow is correct. The latency
of a certain packet can vary a lot due to the path the packets take, the amount of multithreading,
the number of coprocessors and the scheduling algorithm to take a few examples. This means that
we need to simulate a large number of packets to see how the system behaves in general and small
variations will still be visible.
Simulation results
The initial simulations show that the performance will be limited by the shared external memory
that all processors and coprocessors operate on. This was not surprising, but how much it will
limit the performance and at what load can now be determined using the simulator.
One architectural change that could decrease the memory load is to let all packet processors
have the header fields in a small private memory so that the packet processors can operate on all
header fields for all layers without having to access the shared external memory. A graph for the
difference can be seen in figure 6.2. In this graph, the downstream processing part is kept constant
while varying the number of upstream processors and threads.
Memory utilization
1 1
CRC, one external memory
CRC, one external memory plus internal
Figure 6.2: Memory utlilisation when using a commandable CRC coprocessor
The difference was small - only a few percent - which can be explained by the fact that the real
memory accessing elements are the checksum coprocessors. The CRC coprocessor, for example,
has to read every single byte of every packet to compute a CRC. This will be performed for both
the ingress and egress part of the packet processing. In our initial solution, the CRC operates
on the memory because we wanted an architecture that could be changed a lot from its initial
requirement specification. This could for example allow other protocols on layer two than the
standard Ethernet protocol that we now support.
A solution to this is to let the CRC operate on the packet as its being received and transmitted
from the network processor. This will be more of an ASIC solution, but due to the performance
requirements, this will have to be a trade-off that we must accept.
Changing this fact in the simulator was easy - just change how the CRC cost function operates.
Graphs for the new memory utilization for this case can be seen in figure 6.3.
Memory utilization
1 1
No CRC, one external memory
No CRC, one external memory plus internal
Figure 6.3: Memory utilisation for inline computation of CRC
Performance (Packets produced)
CRC operating on packet stream
1 1
CRC operating on memory
Figure 6.4: Performance for a system with one packet memory
As you can see in figure 6.4, the performance increases a lot for the systems with an inline CRC.
This is most noticeable on multiprocessor systems.
To further decrease the memory utilization rate, different external memories can be used for
the upstream and downstream packet processing. The performance for these two systems can be
seen in figure 6.5 (note: Here, the number of both upstream and downstream processors have been
changed unlike in the previous simulations where the number of downstream processors were kept
to a minimum, so the total number of processors are twice as many as noted in figure 6.5). For
a large number of processors, the difference will be large. The performance for the whole system
will be saturated at about 10 processors.
Performance (Packets produced)
Two memories
1 1
One memory
Figure 6.5: Performance for a system with two packet memories
Another interesting graph is how the latency varies with the number of threads in the processor
when having one packet per thread. The theory is that multithreading will increase the latency
since a new packet will be processed whenever a latency occurs. For the round robin scheduler,
the packet that was previously processed will be put last in the queue and will not be processed
until all other packets (i.e. threads) in that processor have been interrupted by a latency source.
The results of simulations can be seen in figure 6.6. The more number of processors will increase
more congestion to the shared resources which will also lead to higher latencies.
Packet latency depending on number of threads
1 x PP
2 x PP
3 x PP
4 x PP
Figure 6.6: Packet latency for different number of threads
5 x PP
Chapter 7
The question if multithreading is a suitable processor technique for improving performance for
network processors has been examined thoroughly, and hopefully some decisions can be taken
early in the development process using the information in this thesis together with the proposed
simulator (EASY).
As the simulations in chapter 5 showed, multithreading can be justified and lead to large
improvements comparing to singlethreaded systems. A total gain of 120% in area efficiency was
accomplished for three threads when communicating with a coprocessor with 200ns latency. However, for large numbers of threads, the performance will decrease due to the extra area required.
Simulations done by other sources[5] have shown similar results. Still, in many modern network
processors, the number of threads for multithread processors increases for every generation of the
product, which could be due to marketing decisions. The trade-off that has to be taken into consideration is the increased complexity of the processor which requires more time on verification,
a more difficult programming model and the worse cache memory performance. However, due to
the nature of network processors, this could in most cases be justified considering the performance
The proposed simulator (EASY) can be used to produce interesting information early in the
design space exploration and its alignment towards multithreading for network processors early
in the system level methodology makes it unique to the authors knowledge. To compare different
architectures such as multiprocessor systems versus multithreaded systems, and to analyse large
applications is practically impossible analytically and this is where the simulator comes to use.
Chapter 8
Future Work
Even though the simulator has proven to be useful to extract some results relating to multithreading
and system level modelling, there are some issues that have not been dealt with that could be useful
to provide more accurate results or to be able to simulate other systems at all. These are some
suggestions, with comments, that could be implemented to improve the simulator
Suggested features
When resources share a common bus, or when the bus has a limited speed or latency, this will affect
the system’s operation. This can be extended to cover advanced interconnection networks such as
crossbar networks, n-d meshes and more. In addition to the modelling part, it should be possible
to obtain results of bus utilization and similar which could be used to improve the architecture.
Data dependencies
The simulator does currently not take into considerations effects of data dependency. To properly
be able to model a system with data caches and similar, the ability to describe data dependencies
are essential. The complexity of the implementation of this function will probably be high and to
be able to describe data dependencies in a high level modelling language might be difficult.
Lower level modelling
When a low level system description is already available, in for example System C or assembler
format, it should be possible to extract necessary parts for the simulation automatically with high
accuracy. This includes for example application modelling, which could easily be translated from
an assembler source code, and program flow statistics, which can be extracted from a program
execution trace. This would, however, lead to a low capability to alter the system afterwards since
the trace would no longer be correct and similar, but that is the price for an accurate simulation,
which is supported by the design pyramid at page 30
[1] Bob Boothe and Abhiram Ranade. Improved multithreading techniques for hiding communication latency in multiprocessors. In ISCA ’92: Proceedings of the 19th annual international
symposium on Computer architecture, pages 214–223, New York, NY, USA, 1992. ACM Press.
[2] Patrick Crowley, Marc E. Fluczynski, Jean-Loup Baer, and Brian N. Bershad. Characterizing
processor architectures for programmable network interfaces. In ICS ’00: Proceedings of the
14th international conference on Supercomputing, pages 54–65, New York, NY, USA, 2000.
ACM Press.
[3] Richard J. Eickemeyer, Ross E. Johnson, Steven R. Kunkel, Mark S. Squillante, and Shiafun
Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In
ISCA ’96: Proceedings of the 23rd annual international symposium on Computer architecture,
pages 203–212, New York, NY, USA, 1996. ACM Press.
[4] Alexandre Farcy and Olivier Temam. Improving single-process performance with multithreaded processors. In ICS ’96: Proceedings of the 10th international conference on Supercomputing, pages 350–357, New York, NY, USA, 1996. ACM Press.
[5] Mark A. Franklin and Tilman Wolf. A network processor performance and design model
with benchmark parameterization. In Proc. of Network Processor Workshop in conjunction
with Eighth International Symposium on High Performance Computer Architecture (HPCA8), pages 63–74, Cambridge, MA, February 2002.
[6] Stephen R. Goldschmidt and John L. Hennessy. The accuracy of trace-driven simulations of
multiprocessors. SIGMETRICS Perform. Eval. Rev., 21(1):146–157, 1993.
[7] Matthias Gries. Methods for evaluating and covering the design space during early design
development. Technical Report UCB/ERL M03/32, Electronics Research Lab, University of
California at Berkeley, August 2003.
[8] Paul Lieverse, Todor Stefanov, Pieter van der Wolf, and Ed Deprettere. System level design
with spade: an m-jpeg case study. In ICCAD ’01: Proceedings of the 2001 IEEE/ACM
international conference on Computer-aided design, pages 31–38, Piscataway, NJ, USA, 2001.
IEEE Press.
[9] Jean loup Baer and Patrick Crowley. A modeling framework for network processor systems,
[10] Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek. The click modular
router. In Symposium on Operating Systems Principles, pages 217–231, 1999.
[11] Xiaoning Nie, Lajos Gazsi, Frank Engel, and Gerhard Fettweis. A new network processor
architecture for high-speed communications. In IEEE Workshop on Signal Processing Systems
(SiPS’99), 1999.
[12] Kurt Keutzer Niraj Shah, William Plishker. Np-click: A programming model for the intel
ixp1200. In 2nd Workshop on Network Processors (NP-2) at the 9th International Symposium
on High Performance Computer Architecture (HPCA-9), Anaheim, CA, February 2003.
[13] Pierre Paulin, Chuck Pilkington, and Essaid Bensoudane. Stepnp: A system-level exploration
platform for network processors. IEEE Des. Test, 19(6):17–26, 2002.
[14] M Peyravian, G Davis, and J. Calvignac. Search engine implications for network processor
efficiency. 2003.
[15] Radhika Thekkath and Susan J. Eggers. The effectiveness of multiple hardware contexts. In
ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for
programming languages and operating systems, pages 328–337, New York, NY, USA, 1994.
ACM Press.
[16] Dean M. Tullsen, Susan Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the22th Annual International Symposium on
Computer Architecture, pages ??–??, 1995.
[17] Theo Ungerer, Borut Robič, and Jurij Šilc. A survey of processors with explicit
multithreading. ACM Comput. Surv., 35(1):29–63, 2003.
[18] Vladimir D. Zivkovic and Paul Lieverse. An overview of methodologies and tools in the field
of system-level design. In Embedded Processor Design Challenges: Systems, Architectures,
Modeling, and Simulation - SAMOS, pages 74–88, London, UK, 2002. Springer-Verlag.
Appendix A
Modeling Language
Executing commands
Simple Instruction, si
Syntax: si
Example: si
The simple instruction is used for modelling instructions that does not communicate with
other peripherals or blocks the thread or processor. A common example is an ”add” instruction.
Block Instruction, block
Syntax: block [<what> <duration>]
Example: block processor 3 40% 7 60%
block thread 20-40
The block instruction blocks the processor or the thread for a given duration. If the thread is
blocked, a context switch might occur. For a complete list of syntax for the what and duration
parameter, please read about penalties in the Architecture File Documentation. If you do not
specify the what and duration parameters, the processor will be blocked for one cycle.
Memory Instruction, memory
Syntax: memory <name> <operation>
Example: memory sdram read
The memory instruction will perform a read or write operation on the memory name attached
to the processor. This might be an internal or an external memory and the operation might be
cached or not. Valid values for the operation is currently ”read” or ”write”. The processor might
be blocked after issuing this instruction and a context switch might occur if the read operation
defined in the architecture file is set to only block the thread.
Coprocessor instruction, coproc
Syntax: coproc <name> <operation>
Example: coproc crc32 calculate
The coprocessor instruction will perform an operation on the coprocessor name attached to
the processor. This might be an internal or an external coprocessor. The operation value has to
be defined in the architecture file prior to calling this instruction. The processor might be blocked
after issuing this instruction and a context switch might occur. This depends on the definition of
the coprocessor in the architecture file and the scheduling algorithm used.
Statistics instructions
Marker instruction, marker
Syntax: marker <name>
Example: marker packet_start
The marker instruction increments the marker with name name in the currently running thread.
This can be used for benchmarking algorithms and architectures.
Program flow instructions
Syntax: <label name>:
Example: beginning:
This is actually not a real instruction since it does not add itself to the instruction queue.
However, it adds the label and the corresponding address (which is calculated) to the jumptable
so that you can use the jump instruction to jump to this destination later. The label name can be
any string and must not contain whitespaces. After the name there is a colon.
Jump instruction, jump
Syntax: jump <location> [<probability>]
Example: jump skip 40%
The jump instruction jumps to the label with name location if the jump is probable enough.
If no [probability] is provided, the jump will always be taken. The processor might be blocked for
a number of cycles depending on the penalty associated with a forward jump, backward jump or
if the conditional jump was not taken. This can be used to model static branch predictions.
Loop instruction, loop loop end
Syntax: loop <duration>
Example: loop 4-6
loop end
loop end
The segment of code between the ”loop duration” and ”loop end” will be run a number of
times depending on duration. Valid values for duration is a fixed number, eg ”5”, a range, eg
”4-6”, or the keyword ”forever” which indicates an infinite loop. Note that you are not allowed
to jump (with the ”jump” instruction) into a loop if you are not already in it. The simulator will
complain about it and exit. It is however allowed to jump within a loop or jump out from a loop.
Interrupt instructions
Interrupt instruction, interrupt
Syntax: interrupt exec <routine>
Example: interrupt exec cleanup
Syntax: interrupt end
Example: interrupt end
The interrupt instruction has two purposes; The first one is to call an interrupt routine using ”interrupt exec routine” which will execute the interrupt and jump to the label with name
routine. The processor might be blocked some cycles before it will reach that address depending on the penalty for issuing an interrupt. This penalty is entered in the architecture file, as
The second use is to end an interrupt routine. The processor will then be blocked a number
of cycles depending on the penalty for leaving an interrupt routine, and then the program will be
returned to the address where it was executing prior to executing the interrupt. The penalty for
executing this instruction is interrupt::end
Switch instructions
Switch instruction, switch
Syntax: switch <thread_name>
Example: switch reader2
This instruction performs a context switch to another thread on the next cycle. Note that the
scheduler might override your results so that either no context switch will occur or a context switch
to another thread will happen. If you switch to a thread that is blocked, this is very probable.
The standard fair round-robin scheduler will allow you to change to any thread that isn’t blocked,
but you might want to write your own scheduler to allow a deterministic switch event.
Semaphore instructions
These instructions are useful for modelling communication, mutual exclusion, synchronisation and
other both control and data dependent issues.
Wait instruction, wait or consume
Syntax: wait <token>
Alternative syntax: consume <token>
Example: consume ingress_packet
The wait instruction will decrement the semaphore counter token, and if the result is negative,
the thread will be blocked until the counter reaches above or equal to zero. This is done at the
same cycle and it can not be interrupted. If the semaphore is non-blocking, a thread switch will
probably occur, but since there is no way of knowing when the semaphore is freed, a context
switch might occur to a thread that is already blocked by a semaphore wait. This might lead to a
bouncing behaviour of the context switching if the scheduling algorithm isn’t aware of this. The
default fair round-robin scheduling will cope with this.
Signal instructions, signal or produce
Syntax: signal <token>
Alternative Syntax: produce <token>
Example: produce egress_packet
The signal instruction will increment the semaphore counter token, and if one thread is
blocked while waiting for the token, it might be released the same cycle or the following cycle. If
this undeterministic behaviour is unwanted is uncertain and can be subject to change.
Message passing instructions
These instructions are used for create a statefulness in the benchmark and to send some information
from one processor to another, or from one thread to another. This could be useful for modelling
advanced coprocessors.
Condition instruction, condition
Syntax: condition set <name> <value>
Syntax: condition test <name> <value> <label>
Example: condition test direction upstream process_upstream
This instruction can either set a condition to a certain value, or test a condition if it matches
another condition. If it does, the program will continue at the label specified with the test condition.
This instruction takes zero clock cycles to execute. Note that all conditions are system global, so
two processors can use these instructions to exchange information.
Appendix B
Architecture File
The architecture file is the place where you describe the memories, coprocessors and processors in
the system. This file together with the target code file(s) are used by the simulator.
The file is a normal text file with lines terminated by a UNIX or windows linefeed. Spaces and
tabs can be used for indentation and you can use comments whose syntax is the same as single-line
C-comments, i.e. they start with “//”.
In the root level you can declare:
public, memory declarations>
public, coprocessor declarations>
Semaphores declarations
The syntax for the semaphore declaration is described as:
semaphores {
<name> <should_block> <init_value>
<name> <should_block> <init_value>
This section is optional and the semaphores don’t have to be declared if they should have an
initalisation value that is different from zero (0) and if they should block the processor.
semaphores {
mutex_one blocking 1
mutex_two non_blocking 2
Memory declarations
The syntax for a memory declaration is described as:
memory <name> {
operation read <penalty>
operation write <penalty>
cached read <percent> <penalty>
cached write <percent> <penalty>
<dump statement>
<queue statements>
All fields within the memory block are optional but you should declare the read and write
memory sdram {
operation read block thread 10
operation write block thread 20-25
cached read 90% block processor 3
queue read
queue write
Coprocessor declarations
The syntax for the coprocessor declaration is:
coprocessor <name> {
operation <name> <penalty>
<dump statement>
<queue statements>
All fields are optional, but you should declare at least one operation. You can have as many
operations as you want as long as its name is unique.
coprocessor hashtable {
operation lookup block thread 10-12
operation insert block thread 40
operation delete block thread 30-40
queue lookup
queue insert, delete
dump hashtable_dump.txt
Advanced coprocessors
The way you model advanced coprocessors using normal processors are to use semaphores to ensure
proper synchronisation and a sharing policy which prevents two processors for working with the
same coprocessor at the same time. The syntax of issuing a coprocessor action is to use these
microinstructions (please note that semaphores are executed in one clock cycle which will cause
some overhead)
// The
coprocessor will work here, and this processor will block
The outer wait/signal pair is to ensure that only one processor can execute an action on the
coprocessor. The second line is to start the coprocessor and on the third line, the host processor
will be blocked until the coprocessor has finished its execution. During this time, the processor
might switch thread if the semaphores are defined in that way and if the processor supports
multithreading. The corresponding code for synchronising the coprocessor looks like:
loop forever
wait start_<coprocessor>
// Do the coprocessor action here
signal <coprocessor>_completed
loop end
To make it possible for a coprocessor to have different actions, or issue different operation
depending on who called it, conditions can be used for communicating with the coprocessor. Please
note that conditions are executed without a delay, i.e. in zero clock cycles.
condition set <coprocesor>_mode <action>
signal start_<coprocessor>
signal may_talk_with_<coprocessor>
The corresponding code for testing the condition will be:
loop forever
wait start_<coprocessor>
condition test <coprocessor>_mode <action_one> <label_one>
condition test <coprocessor>_mode <action_two> <label_two>
// Do the coprocessor action here
signal <coprocessor>_completed
jump beginning
// Do the coprocessor action here
signal <coprocessor>_completed
loop end
The advanced coprocessors can only execute one of the actions at the time, just as traditional
coprocessors are behaving. You could also use many coprocessors and let each of them handle one
action, and then use semaphores to make sure that only one of them can be used at the same time.
That solution is harder to describe, but will allow you to more easily change the mapping on what
actions should be available on what coprocessor, which is more consistent to the Y-chart model.
Processor declarations
Main syntax
The syntax for the processor declaration is:
processor <name> {
<dump statement>
markers <filename>
<instruction cache statement>
<interrupts block>
<instructions block>
<threads block>
use memory <name>
use coprocessor <name>
<internal, private, memory declarations>
<internal, private, coprocessor declarations>
You must specify the interrupts, instructions and threads-block but the rest are optional even
though you should connect at least one memory or coprocessor or define internal ones. You can
specify any externally declared memory or coprocessor using the ”use” statement and they can be
connected to as many processors as you want. The internal memories and coprocessors are only
accessible by the processor in where they are declared and their name must only be unique to the
processor in where they are declared. So two processors might have their own private memory,
both called ”sram”.
Interrupts block
The interrupt block is declared as:
interrupts {
interrupt <name> {
frequency <frequency>
enter <enter-penalty>
leave <leave-penalty>
You can declare as many interrupts as you want and you must not specify any interrupt at
all if you do not with to. The <frequency> can either be a fixed number, eg ”5”, or a range, eg
”10-49”. In the first case, it means that the interrupt will occur every 5 cycle and in the second
case it will happen every 10 to 49 cycles. The <enter-penalty> and <leave-penalty> are in the
standard penalty format. The interrupt will start at a label called <name>, and this is the only
required parameter. Frequency, enter and leave are all optional.
Instructions block
The instructions block is defined as:
Instructions {
<instruction> {
action1 <penalty>
action2 <penalty>
<instruction> {
action <penalty>
The list of instructions and their respective action can be found in the manual for the microinstructions. You should define all microinstructions and their actions here to be sure the processor
behaves properly.
Threads block
The ”threads” block is defined as:
threads {
switch <penalty>
thread <name> {
code <filename>
thread <name > {
code <filename>
You must specify at least one thread and the context switch penalty if you have more than
one thread. The names of the thread do not have to be unique, but they should be.
Ports are defined as:
port <name> {
<queue statements>
operation <name> <penalty>
All fields within the port block are optional but you should declare the read and write operation.
Instruction cache statement
To model the processor running code from an external memory using a cache memory in between,
this statement should be used.
instruction cache <penalty>
instruction cache block processor 0 10% 3 90%
This statement is optional.
Processor example
processor pp32 {
dump pp32_dump.txt
markers pp32_markers.txt
interrupts {
interrupt check_buffers 140
instructions {
interrupt {
exec block processor 2
end block processor 1
jump {
no_jump block processor 1
backward block processor 3
forward block processor 3
loop {
enter block processor 0
loop block processor 0
exit block processor 0
threads {
switch block processor 0-1
thread 1 {
code reassembly.code
thread 2 {
code reassembly.code
thread 3 {
code control.code
use memory sdram
use coprocessor fast_buffer
memory sram {
operation read block processor 1
operation write block processor 1
port one {
operation read block processor 1
operation write block processor 1
Dump statement
The syntax of the dump statement is:
dump <filename>
For every cycle, the unit (coprocessor, memory, processor etc) will dump its internal state to
the file. The file consists of many lines containing TSV (tab separated values). The first line of
the file will contain a header which describes the meaning of the rest of the lines. Since the unit
might dump many lines for every clock cycle, this file can be fairly large after simulating many
clock cycles.
dump memory_sdram_dump.txt
Queue statement
The syntax of the queue statement is:
queue {
operation <operations>
max_parallelism <number>
rtr_time <number>
For units that support the queue statement, they will queue up requests to them in the
queue where the operation requested exists. The rtr time decides how long time must elapse
between two requests, and the max parallelism decides on how many requests that can be processed
simultanously. These parameters can be used for modelling a pipelined coprocessor by setting
max parallelism = 3 and rtr time = processing time/3 if you have three pipeline stages. You
can have many queues and one or more operations per queue.
queue {
operation lookup
max_parallelism 3
rtr_time 30
This will create two queues. All read operations will be queued in one queue, and all write
and erase operations will be queued in another. This is particularly useful for coprocessors. There
is no limit of the queue and you might have to investigate the dumps (see dump statement) to see
when a queue has too many operations in it.
Penalty statement
The syntax of the penalty statement is:
block <what> <duration>
The <what> parameter can either be ’thread’, which means that only the currently running
thread will be blocked and if the architecture and scheduling algorithm permits, a context switch
might occur to another thread that is marked ’ready’. If the <what> parameter is ’processor’,
then the entire processor will be blocked and no context switch can occur.
There are three variants of the <duration> parameter:
• Fixed number, eg ’5’, which will block the thread or processor for a fixed number of cycles.
• Range, eg ’5-10’, which will block the thread or processor for a number of cycles between 5
and 10. The probability is evenly distributed between these values and the end values, ie ’5’
and ’10’ are also candidates.
• A weighted list. The list consists of pairs of tokens, where the first token is the number of
cycles, and the second is the probability. You can enter as many tokens as you want, and
the last probability field may be the string ’rest’ to specify the remaining percentage to sum
up to 100
block processor 1 30% 2 10% 3 60%
block processor 1 30% 2 10% 3 rest
These two examples are identical.
You will get an error message if the percentages don’t sum up to 100%, or if you use the ’rest’
amount when the amounts have summed up to more than 100%, which would force a negative
value of ’rest’.
Appendix C
Simulator Features
• High level modelling of application in coarse-grained resolution.
• Application and architecture files are human readable text files.
• Output statistics and other data files are stored as easily parsed text files.
• Processors
– Support for multiple hardware contexts (threads)
∗ Each thread can run a specific application
∗ A user-defined delay can be associated with switching threads
∗ Software multithreading and contexts stored in SRAM can be simulated using penalties
∗ Thread switches can occur for all penalties that are declared as only thread-blocking
and not processor-blocking, e.g. memory accesses, coprocessor accesses, semaphores
and more.
∗ Support for different scheduling algorithms
· Fair round-robin and no scheduler is provided by default
∗ Manual context switching can also be forced in application code
– Interrupts
∗ Support for an infinite number of interrupt routines
∗ Support for different priorities - a high priority interrupt can be executed when a
low priority interrupt is already executing
∗ Every interrupt routine can either be called automatically on a fixed or random
interval, or be called manually by another running thread
∗ A penalty can be associated with entering and leaving an interrupt routine thus
simulating the overhead associated with interrupts
– Instruction penalties
∗ Some special instructions, such as branches, loops and similar, can have an additional penalty associated with an action. For example is it possible for forward
branches to have a different penalty from backward branches to simulate static
branch prediction.
– Internal coprocessors, memories and ports
∗ An internal coprocessor, memory or port can only be used by the processor it is
attached to and does not need to have a unique name.
– External coprocessors and memories
∗ External coprocessors can be shared among many processors and issues that arise
with this mapping, such as queuing, are modelled.
– Instruction cache
∗ A cache miss percentage and the penalty for an instruction fetch cache miss can be
specified to model instruction cache.
• Multiprocessor approaches
– An unlimited number of processors can be used in a system - each of them running
– Synchronisation and communication between processors
∗ Synchronisation, communication and more can be modelled using global atomic
∗ Real information exchange can be modelled using conditions (should be used together with semaphores for creating a mutual exclusion area to ensure deterministic
– Resources, such as coprocessors, memories and similar can be shared among processors
in a multiprocessor system and requests can be queued
• Ports
– Support for read and write operations with different penalties
• Memories
– Support for read and write operations with different penalties
– Cache memory support
∗ The expected cache miss ratio can be specified
∗ Penalties for a cache hit and a cache miss can be specified
∗ It is possible to simulate ”context switch on cache miss” using penalties.
– Queue support
∗ Memory requests can be queued if the memory is busy
• Coprocessors
– Support for an infinite number of actions, where every action can have a different penalty
– Advanced queue support
Coprocessor requests can be queued if the coprocessor is busy
It is possible to let the coprocessor issue a limit number of requests in parallel
It is possible to define a minimum time between two following requests
Using this, a pipelined coprocessor, or a load sharing cluster can be modelled.
• Advanced instruction controls
– An instruction normally takes one or more cycles, but to simulate some multi-resourceinstructions (e.g. an instruction that reads from memory and writes to another), several
instructions can be started at the same cycle (not necessarily having to end at the same
time though)
• Statistics generation
– Coprocessors’, memories’ and processors’ internal states, such as requests, queues, cache
memory hits and similar can be dumped to file for every cycle executing, thus acquiring
statistics from the previous run. This can be used for examining performance, utilisation
and more.
– Checkpoints (markers) can be put in the application code, which will be recorded together with the clock cycle when the checkpoint was passed. This can be used for
examining latency, program flow and packet lifetime.
Appendix D
NAT - Flow Chart
Get from ingress
Resets registers, memories
Fetch a packet pointer from
a LAN ingress queue
Read field from memory
Extract L2 protocol
Simple comparison
Check L2
Get ethernet frame
destination and
Read fields from memory
Simple comparison
Send MAC-address to
coprocessor and start it
Lookup outgoing
port number
Fetch frame CRC
Fetch the old CRC value at
the end of the frame, ie a
memory read
Send to egress
Calculate CRC on
ethernet frame
Put packet pointer address
on bus and start crc
Put packet pointer on the
appropriate egress queue
Get calculated crc from
coprocessor and compare
Header CRC
Extract L3 protocol
Read L2 type field, ie a
memory access
Simple comparison
Put the packet pointer on the
host interface bus and issue
a host interrupt
L3 protocol?
Send to host
Figure D.1: Downstream ingress part
Get IP
Read the IP fields and
interpret them, ie memory
read plus some more
Read header fields and see
if packet is fragmented
Send packet pointer to
Send to
The coprocessor
reassembles the packet and
returns if it is complete
Extract IP source,
destination, L4 prot, TCP
source port, dest port and
start classification coproc
Will continue handling these
packets. Eg TCP/UDP
Send to host
Figure D.2: Downstream IP packets
Lookup destination
L4-port and ip
address in
Extract port and build
ip+port-token and send it to
CAM coprocessor
Get coprocessor results
Put ip+port on coprocessor
port and start it
Add entry to
Refresh mapping
Change dest ip
address and L4port
Put ip+port+protocol on
coprocessor port and tell it to
refresh the timeout
Replace packet fields - write
to memory.
Calculate new
Put packet pointer on
checksum coprocessor port
and start it
checksum in
Write TCP/UDP checksum to
TCP/UDP packet
Calculate new IP
Put packet pointer on
checksum coprocessor port
and start it
Store IP checksum
in packet
Write IP checksum to IP
Get packet length
Put packet pointer to
fragmentation coprocessor
Get packet length, ie a
memory read
Compare the length with the
media MTU
Fragment packet
Lookup MACadress
Put IP on ARP coprocessor
port and start the
Calculate CRC
Send packet pointer to
checksum coprocessor.
Write CRC to
Decide queue
Send to egress
Replace the packet’s CRC
with the computed one
This information is fetched in
the classification
coprocessor. Interpret the
results and put in queue
Put the packet pointer on the
WAN egress port
Figure D.3: Downstream TCP/UDP port translation
Get from ingress
Resets registers, memories
Fetch a packet pointer from
the WAN ingress port
Read field from packet
stored in memory
Extract L2 protocol
Simple comparison
L2 protocol
Read fields from memory
Get ethernet
destination and
Simple comparison
Fetch frame CRC
Fetch the old CRC value at
the end of the frame, ie a
memory read
Calculate CRC on
ethernet frame
Send packet pointer to CRC
coprocessor and start it
Header CRC
Fetch the results, extract
packet’s CRC field and
Extract L3 protocol
Read out the ethernet type
field, ie a memory access
Get return value and
compare L3 protocol type
Put the packet address on
host interface bus and issue
an interrupt
L3 protocol?
Send to host
Figure D.4: Upstream ingress
Get IP
Read the IP fields and
interpret them, ie memory
read plus some more
Read header fields and see
if packet is fragmented
Send packet pointer to
Send to
The coprocessor
reassembles the packet and
returns if it is complete
Extract IP source,
destination, L4 prot, TCP
source port, dest port and
start classification coproc
Will continue handling these
packets. Eg TCP/UDP
Send to host
Figure D.5: Upstream IP packets
Lookup destination
L4-port and ip
address in
Refresh mapping
Change dest ip
address and L4port
Calculate new
checksum in
Calculate new IP
Store IP checksum
in packet
Get packet length
Creates a key of ip+port and
sends it to coprocessor
which does a CAM search
Get the return value from the
coprocessor and see if the
entry was found
Put ip+port to mapping
coprocessor and ask it to
refresh timeouts
Write to packet in memory (2
Put packet pointer on
checksum coprocessor port
and start it
Write TCP/UDP checksum to
TCP/UDP packet
Put packet pointer on
checksum coprocessor port
and start it
Write IP checksum to IP
Get packet length, ie a
memory read
Compare the length with the
media MTU
Fragment packet
Lookup MACadress
Put IP on ARP coprocessor
port and start the
Calculate CRC
Send packet pointer to
checksum coprocessor.
Write CRC to
Replace the packet’s CRC
with the computed one
Lookup outgoing
port number
Sends the packet's MAC
address to a coprocessor
and fetches the outgoing port
Send to egress
Puts the packet pointer to
the appropriate egress
Figure D.6: Upstream TCP/UDP port translation
På svenska
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –
under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för
ickekommersiell forskning och för undervisning. Överföring av upphovsrätten
vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i
den omfattning som god sed kräver vid användning av dokumentet på ovan
beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan
form eller i sådant sammanhang som är kränkande för upphovsmannens litterära
eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se
förlagets hemsida
In English
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to downlo. ad, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page:
© Victor Boivie
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF