A High Performance Agent Based Modelling Framework

A High Performance Agent Based Modelling Framework
A High Performance Agent Based Modelling Framework
on Graphics Card Hardware with CUDA
Paul Richmond
Dr Simon Coakley
Dr Daniela Romano
University of Sheffield, UK
Department of Computer Science
Regent Court, 211 Portobello
Sheffield, S1 4DP
+44 (0) 114 222 1877
University of Sheffield, UK
Department of Computer Science
Regent Court, 211 Portobello
Sheffield, S1 4DP
+44 (0) 114 222 1900
University of Sheffield, UK
Department of Computer Science
Regent Court, 211 Portobello
Sheffield, S1 4DP
+44 (0) 114 222 1800
We present an efficient implementation of a high performance
parallel framework for Agent Based Modelling (ABM)
exploiting the parallel architecture of the Graphics Processing
Unit (GPU). It provides a mapping between formal agent
specifications, with C based scripting, and optimised NVIDIA
Compute Unified Device Architecture (CUDA) code. The CUDA
specific graphics card hardware is introduced, and existing high
performance implementations of interacting systems are
presented. The mapping of agent data structures and agent
communication is described, and our work is evaluated through a
number of simple interacting agent examples. In contrast with an
alternative, single machine CPU implementation, a speedup of
80 times is reported.
clusters, which make them expensive and un-suitable for the
mass market. This paper describes the implementation of a high
performance parallel framework for ABM which exploits the
parallel architecture of the Graphics Processing Unit (GPU).
The GPU is primarily designed to stream graphics primitives
through a rendering pipeline. Recent interest has however
highlighted performance gains in algorithms that exploit the
hardware for general purpose use (often referred to as General
Purpose computation on the GPU or GPGPU). In the past
GPGPU techniques have focused on utilising graphics libraries
such as OpenGL or DirectX to exploit the architecture [17]. The
lack of direct access to hardware functionality does however
make the process tedious, especially with respect to debugging.
Fortunately a number of alternative approaches are now available
for programming the GPU [2, 12]. These provide an intermediate
extension to the C language specification for general stream
processing. During compilation the steam programming code is
translated to either graphics based or C++ reference code.
Although this technique simplifies the process of programming
the GPU, it is still reliant on graphics based APIs and does not
provide direct access the GPUs underlying architecture.
NVIDIA Compute Unified Device Architecture (CUDA) library
[15] is an exception to this problem, allowing direct access to
GPU processors and device memory. Despite CUDA’s intuitive
programming interface, performance gains are often achieved
only through careful optimisation, requiring advanced knowledge
of the hardware’s capabilities and optimal operating conditions.
Agent Based Modelling (ABM) is the simulation of group
behaviour from a number of individually autonomous agents. The
ability to simulate complex systems from simple rules makes
ABM attractive in numerous fields of research including, but not
limited to, systems biology, computer graphics and the social
sciences. Generally ABM tools such as Repast 1, Mason 2 and
Swarm3 are primarily aimed at a single CPU architecture, and
whilst they are well developed and offer simple agent
specification techniques, their inherent lack of parallelism
seriously affects the scalability of models. Alternatively high
performance frameworks for ABM [4] have targeted processing
The focus of this paper is the efficient implementation of a
complete ABM framework for the GPU. More specifically it
describes the mapping between a formal agent specifications
with C based scripting and optimised CUDA code. This includes
a number of key ABM building blocks such as multiple agent
types, agent communication and birth and death allocation. The
advantage of this is two fold. Firstly Agent Based (AB)
modellers are able to focus on specifying agent behaviour and
run simulations without explicit understanding of CUDA
programming or GPU optimisation strategies. Secondly
Categories and Subject Descriptors
I.2.11 [Computing Methodologies]: Distributed Artificial
Intelligence - Languages and structures, Multiagent systems,
I.3.1 [Computer Graphics]: Hardware Architecture - Graphics
processor, Parallel processing.
Agent Based Modelling, Performance, Parallel Algorithms,
Graphics Processing Unit, CUDA
Cite as: Title, Author(s), Proc. of 8th Int. Conf. on Autonomous Agents
and Multiagent Systems (AAMAS 2009), Decker, Sichman, Sierra, and
Castelfranchi (eds.), May, 10–15, 2009, Budapest, Hungary, pp. XXXXXX. Copyright © 2009, International Foundation for Autonomous Agents
and Multiagent Systems (www.ifaamas.org). All rights reserved.
simulation performance is significantly increased in comparison
with non GPU alternatives. This allows simulation of larger
model sizes and offers high performance modelling at a fraction
of the cost of high performance grid based alternatives.
The paper first introduces CUDA specific graphics card
implementations of interacting systems. The mapping of agent
data structures and agent communication is then described.
Finally the framework is evaluated through a number of simple
interacting agent examples which demonstrate key performance
Whilst this paper describes to the best of our knowledge the first
fully functioning High Performance ABM framework for the
GPU some existing work has been inspirational. The work
described in this section varies with respect to application and
implementation platform. The common underlying theme
throughout is centred on ABM or high performance interacting
The class of GPU hardware targeted by the work in this paper is
specifically limited to CUDA enabled graphics cards. Whilst it is
desirable to support a wider range of GPUs the CUDA API
allows access to a hardware functionality not supported by older
generation cards and competing GPU manufactures. More
specifically the availability of local (on chip) shared memory
offers extremely fast parallel memory access operations for
threads within the same multiprocessor. In addition to this local
synchronisation provides thread cooperation allowing data
caching through shared memory access.
Cellular Automaton [21] (CA) are a simplistic example of an
interacting system. As a subset of ABM, CA are more confined
due to the discrete space environment and limited finite number
of states. Early high performance CA examples [7] have
previously utilised GPU performance however their adaptability
towards more advanced ABM is limited. More interesting are
systems which alleviate the discrete environment and state
limitation. The Coupled Map Lattice (CML) [9] makes a
noticeable improvement over CA by providing continuous lattice
values. Better still are GPU particle system implementations,
which focus on the modelling of classes of certain fuzzy
phenomenon. Similarly to ABM, particle systems are
continuously valued systems with interacting individuals.
Technically many implementations of interacting particle systems
are based around the particle mesh method [3, 13]. This involves
converting particles into discrete space density values which are
then used to approximate interactions such as gravitational
potential. Similarly to this is work by D’Souza et al. [5] which
utilises discrete partitioning, with the difference that agents are
directly scattered into discrete space rather than cumulative
density values. This work has many similar interests to our own
and demonstrates a number of high performance agent based
models on the GPU. Unlike our own work the discrete
partitioning nature of the algorithms is memory intensive and
limits the agent environments to fine grained 2D or course
grained 3D. As the discrete partitions are increased in size the
memory requirement is reduced however the likelihood of
collisions (multiple agents scattered within the same partition)
increases. This is addressed by D’Souza et al. [5] through the
implementation of a multi pass priority scheme. Whilst in
summary being extremely efficient, the reliability of the priority
scheme is questionable, as is the convergent random iterative
scheme used for birth allocation. Additionally little consideration
is given towards agent specification or more general agent
systems such as those that exist within spatially distributed,
continuous 3D environments.
The GPU programming model is described in detail in the
CUDA Programming Guide [14] where it is presented as a
parallel coprocessor. The GPU device architecture is described as
Single Program Multiple Data (SPMD) where the program, or
kernel, is some function native to the device, which then operates
on multiple threads each inputting and outputting different
(usually linearly offset) data. In order to generalize CUDA to
multiple hardware implementations (with varying parallel
capabilities) the idea of a grid of thread blocks is used to group
cooperating threads. The thread blocks which must all share the
same dimensionality and kernel instruction are then
When considering pure performance, implementations that are
aimed only at limited range interactions have by far the best
performance. This performance is gained through avoidance of
the O(n²) complexity imposed on systems with brute force total
communication. Recent work by Richmond & Romano [19]
follows the same technique as the one used for collision detection
between advanced interacting particle systems [8] to implement a
framework for swarm systems on the GPU. It is suitable for a
number of hardware platforms and has been demonstrated
previously on the PS3 [18] (which differs slightly in that an NNearest neighbour scheme is used). Discrete spatial partitioning
The speed of GPU hardware is attributed to the architectural
design. Unlike more generic and flexible CPUs the GPUs
architecture is task specific making it highly optimised for stream
programming applications. Technically the GPU not only exceeds
the transistor count of modern CPUs, but a significantly higher
portion of transistors are available for data processing, rather
than data caching and flow control [14]. In addition to this the
GPUs memory bandwidth exceeds that of system memory
bandwidth by a factor of 10. Figure 1 demonstrates the
computational power of the GPU in direct comparison with Intel
Figure 1 - Performance of GPU (Green) vs. CPU (Blue).
Figure courtesy of NVIDIA [14].
is used, however agents are or particles are sorted into an
ordered list. This avoids the memory cost associated with storing
individuals directly within partitions and allows any number of
agents per partition. Partition boundaries (maximum and
minimums from the sorted list) are then scattered into a matrix to
allow agents to directly access all agents within neighbouring
spatial bins. In the case of a swarm example presented by
Richmond & Romano [19] up to 65k agents can be simulated and
rendered at roughly 30fps with exact performance depending on
the communication radius between agents. Whilst the
performance is below that of pure collision detection within a
CUDA particles system [8], Richmond & Romano’s [19] work
demonstrates a useable C++ framework allowing single agent
specification and generalised agent scripting. The work described
in this paper improves upon this not in performance but in
flexibility. Multiple agent types, environment interactions and
birth and death allocation are essential for more generalized
ABM beyond that of simple swarms.
Whereas the previous techniques deal with discrete space and
limited communication radii, the communication mechanism
presented in this paper is achieved through a brute force O(n²)
technique. The decision to use such technique is influenced by a
number of factors. Firstly the aims of this paper are to provide a
flexible ABM framework suitable for as wider class of models as
possible. Brute force all pairs communication not only allows any
range of interaction to be evaluated, but recent work
demonstrating brute force N-Forces modelling [16] suggests also
that’s that O(n²) algorithms can be balanced almost optimally on
GPU hardware. This allows Giga-Floating Point Operations
(GFLOP) performance close to the GPUs theoretical limits.
Secondly the FLexible Agent Modelling Environment (FLAME)
architecture [4], which this paper extends (and described in
detail in the following section), utilises the same O(n²)
communication pattern. This permits a direct comparison of the
performance of our GPU specific implementation and FLAMEs
original, single CPU alternative.
Formal agent based specification is important within agent based
modelling as it allows a simple and intuitive way of defining
agents and their associated behaviour. The choice to extend the
FLAME framework and its open specification format not only
aids better collaboration and understanding, but provides a basis
for formal validation and verification of code [4]. Whilst not a
modelling platform itself, FLAMEs formal specification language
(XMML) is based around a formal modelling concept called the
X-Machine [6]. X-Machines, have previously been used for
formal verification of swarms [20] and, are themselves an
extension of Finite State Machines (FSMs). They differ with
respect to their inclusion of internal memory, which may be
modified during the transition of internal states. FLAME builds
upon a smaller class of X-Machines known as Communication
Stream X-Machines (CSXMS) [1] that due to their streaming
data design are well suited for integration within parallel
systems. Stream X-Machines from the main foundation for agent
specification. Agents are defined as a set of states and internal
memory, with a transition function determining the next agent
state and performing internal memory updates (Figure 2). The
extension of communication simply provides a mechanism for XMachines to exchange messages through a communication
matrix. This is instead replaced in FLAME by more flexible
variable length Message Lists.
Figure 2 – Stream X-Machine Specification, M and M’
represent agent memory before and after agent function F1
which inputs and outputs messages to the message list.
In addition to XMML, FLAME also provides a template based
system for code creation. In theory the FLAME specification and
template systems should be suitable for simply providing a GPU
simulation backend, but in practice there are a number of subtle,
yet key changes, which have been integrated. These changes are
mostly the result of the finer grained parallelism offered by the
GPU over FLAMES high performance PC grid implementation.
Rather than parallel nodes containing a variable number of
agents communicating messages through MPI, the lightweight
parallel threads of the GPU are more directly suited to an
individual agent level. This implies that any global functions or
variables previously used to iterate messages and agents on a per
node basis have been removed in favour of parameterized agent
The XMML specification has been changed to
includes the formal specification of agent function input and
output messages. Likewise an additional parameter
<bufferSize> is required for each agent and message type
within the system. This acts as a maximum size of either the
agent or message population and is required as a result of pre
allocating GPU memory before the simulation stage. Although
this places a stringent limitation on simulation size, the removal
of dynamic memory allocations during the simulation is essential
in gaining maximum GPU performance. End-users are warned at
any point during simulation if a bufferSize is exceeded and
it is recommended that sufficiently large buffers are used
(obviously within the bounds of GPU memory) as unused buffer
space has no negative effect on simulation performance.
Theoretically each agent function is represented by a unique
GPU kernel. This provides a logical function mapping which
ensures global synchronization of the entire agent population
after each transitional stage. In practice however, our framework
wraps user specified agent functions with a special GPU kernel
which hides efficient memory access from the GPUs global
memory. Individual agent data per parallel thread is then passed
to the agent function after it is stored temporarily in the much
faster multiprocessor register space.
For simplicity this
individual agent data is stored using a single C structure. This
allows the agent function to get and set agent variables directly
and protects unsafe direct access to other agents within the
population. Although it would be intuitive to therefore store the
agent population data with an Array of Structures (AoS), this has
serious memory access performance implications. Instead, agent
population data is stored as a single Structure of Arrays (SoA)
(Figure 3). This allows a more efficient memory access pattern
for both reading and writing data in global GPU memory. The
reason for this is GPU memory coalescing, which allows data
accessed by consecutive threads to issue fewer wide memory
requests, making more efficient use of the memory bus [11]. The
conditions of coalescing are that data variables within
consecutive threads are accessed with the same linear
consecutive order. The exact performance advantage of this
technique is evaluated later in Section 8.
typedef struct agent{
typedef struct agent_list{
float mem_val_1;
float mem_val_1[N];
float mem_val_2;
float mem_val_2[N];
} xm_memory_agent_list [N];
} xm_memory_agent_list;
0 1 2 3
N 0 1 2 3
Figure 3 - AoS vs. SoA data storage.
A single SoA for agents is sufficient for agent functions which
update only their internal state. Additional data storage is
however required to provide agent birth and death functionality.
More specifically for the case of agent births, any agent may, or
may not produce a new agent. Therefore the entire agent
population must be double buffered to provide sufficient storage
space for potential new agents.
The XML tag
<agentOutput> has been introduced to the XMML
xmachine_memory_agentname_list SoA pointer should
be passed as an argument to any agent function requiring the
allocation of new agents of that type. Agent functions can then
make use of specific add_Agentname_agent function,
which requires the SoA pointer as an argument. Rather than
represent the current agent memory list, the SoA pointer
argument represents the double buffered memory space
(demonstrated in Figure 4). The equal size of the agent data SoA
and new agent data SoA allows a linear output of new agent data.
The new agent data is output to the same position in the doubled
buffered new agent list as the existing parent agent. As it is
likely that not every agent will give birth to a new agent during
the agent function it is important that the potentially large set of
new agents is compacted before they are appended to the existing
agent population. In order to achieve this, newly created agents
use a _flag variable within the SoA agent memory list, to
indicate the presence of new agent data. Following the agent
function an (linear time step) O(n) inclusive parallel prefix sum
algorithm [10] is used to write the sum value to an additional
_position variable within the SoA agent memory list. This
allows a final (post agent function) scatter kernel to run over the
new agent memory list, appending flagged data to the end of the
original agent memory. The updated agent list is then used for
the next agent function or simulation iteration.
Similarly to agent births, agent deaths require additional
buffered data storage which is referred to as the swap list. This
swap list is of the same dimension and type as the original agent
data SoA and acts as output during the post agent function
compacting process. Likewise with agent births, agent deaths
require a flag to indicate data to be removed from the agent data
list. For this the _flag variable in the original agent memory
list is set in the agent function wrapper kernel. The value of the
flag depends on the return value of agent function (which always
return a single integer value). A return value of 1 indicates an
agent death. The same parallel prefix sum algorithm and agent
scatter functions are then applied to the original agent data list
with the compacted agent data placed into the swap list. Deaths
are handled before birth allocation with pointers to the original
agent data and swap being exchanged following dead agent
removal process. This results in a compacted agent list before
any new agents are appended.
As both birth allocation and death are potentially costly with
respect to performance, the XMML specification guards against
any unnecessary computation by ensuring that births are only
evaluated under the condition of the <agentOutput> tag.
Similarly a Boolean <reallocate> tag (with a default value
of true) is available for to avoid compaction of the original agent
data list where it is known in advance that the agent function will
never result in an agent death. An example of a GPU wrapper
kernel is demonstrated in Figure 4. This shows the coalesced
data reading and writing as well as the setting of the death flag
for agents. The integer value index refers to the agent position
in the agent population and is determined through the current
thread and thread block positions on the GPU multiprocessor.
__global__ void GPUFLAME_agentFunction(
xmachine_memory_Agentname_list* agents,
xmachine_memory_Agentname_list* agent_births){
int index = __mul24(blockIdx.x,blockDim.x) +
//SoA to AoS - Coalesced memory read
xmachine_memory_Agentname agent;
agent.mem_val_1 = agents-> mem_val_1[index];
agent.mem_val_2 = agents-> mem_val_2[index];
//agent function call
int dead = !agentFunction(&agent, agent_births);
//reallocation flag
= dead;
//AoS to SoA - Coalesced memory write
agents-> mem_val_1[index] = agent.mem_val_1;
agents-> mem_val_2[index] = agent.mem_val_2;
Figure 4 - An agent function wrapper kernel, demonstrating
coalesced memory access, setting of the agent death flag and
passing of the new agent data SoA.
In distributed agent systems a message passing interface (MPI) is
essential for communication between nodes. For our GPU system
however, all agent data is contained on the GPU making direct
agent access an option. Despite this our work implements
message lists for the following reasons. Firstly a non message
based system would require further buffer of agent memory to
represent streamed input and output. This would be mandatory to
ensure that previous simulation step data would remain constant
across the population during an agent function. Alternatively, a
message based system more efficiently uses memory resources.
Only the data required for communication needs to be duplicated
rather than the potentially much larger, full set of agent memory.
Secondly the use of messages lends itself well to future multi
GPU implementations. In this case messages may simply be
passed between multiple GPU equipped hosts using MPI.
Finally message outputs are of two distinct types, a single
message or optional message. In the case of an optional message
type, lists may be significantly smaller than the agent population.
This makes message list iteration more efficient than direct
access to every agent.
Reading messages is performed in a similar way to accessing
agent variables. A pointer to a C structure, stored in the
xmachine_message_messageName format, is returned
from a message retrieval function. As with agent data, reading
and writing message data from gloabal memory is performed
xmachine_message_messageName_list SoA variable
is used for storing the message list and is required as an
get_first_message() and get_next_message()
functions. The XMML specification has been adapted to include
<input> and <output> tags and an additional <type> tag
indicates the message type of either single_message or
optional_message for message outputs. When outputting
messages, both message types follow the same linear output
process of writing to the message list in the same location as the
position of the agent in the agent list. Where some previous agent
function has already written messages to the list this location is
shifted by the message list size.
In the case of optional
messages a message list buffer is used and the previously
described compaction technique is applied before new messages
are appended to the original list. Figure 5 demonstrates an
example agent function reading an input message1 and writing
a single message output message2. Below this is the
corresponding XMML specicifcation of the message function.
//Input : message1, Output: message2, Agent Output: none
__FLAME_GPU_FUNC__ int exampleFunction(
xmachine_memory_SimpleAgent* xmemory,
xmachine_message_message1_list* message1_messages,
xmachine_message_message2_list* message2_messages)
/* get the first message1 type message */
xmachine_message_ message1* message1_message =
get_first_ message1_message(message1_messages);
xmemory->var1 += message1_message->message1_var1;
/* get the next message1 type message */
message1_message = get_next_ message1_message(
/* output a message2 type message */
float message2_var1 = xmemory->var1;
float message2_var2 = xmemory->var2;
return 0;
Figure 5 – An example agent function with corresponding
XMML function specification.
Individual message access within agent functions is handled
get_next_messageName_message (also demonstrated in
figure 5). These functions implement the brute force message
loading which, inspired by Nyland et al. [16], utilizes shared
memory by serialising message access across threads.
Technically this requires that messages are split into groups with
the first message group being loaded into shared memory by the
get_first_message() function (Figure 6). Following this
each thread within the same thread block sequentially requests
new messages using the get_next_message() function.
After each thread has exhausted the messages within the group
the get_next_message() function synchronises threads in
the block and loads the next group of messages into shared
memory (Figure 7).
As both the message group size and thread block size are equal,
individual threads are responsible for loading shared memory
values concurrently. A thread synchronisation is performed after
loading any data into shared memory and ensures that all
messages are available to all threads within the block. To avoid
all thread blocks reading the same groups, the first group load of
any block (issued by the get_first_message() function)
starts by loading data into shared memory at offset locations in
global memory. Thread blocks beginning mid way through the
message list load each message group sequentially from their
starting group before circulating back to the first. The
get_next_message() function then returns false after the
same number of messages across the entire agent population
have been processed.
Group 1
Group 2
Group 3
Group 4
0 1 2 3 4 5 6 7 8 9 10 11 12
Grid Block
0 1 2 3
Message List
the message count and avoids any unnecessary access to DRAM
memory within the get_next_message() function. Figure
8 also makes reference to the resetting of agent and message
swap buffers. This is simply an additonal kernal which runs over
the agent or message list setting the _flag variable to 0. It is
not nesacessary to reset the entire message or agent swap data as
this will never be appended to the orginal lists, until both the
data and _flag values are overwritten.
FOR each message output
check for possible out of bounds
set message output type in SoA list
reset message swap buffer
FOR each message input
set shared memeory size to largest message + 1 int
Figure 6 – Message group loading, when requesting the first
and next message.
Group 1
Group 2
Group 3
Group 4
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1 2 3
FOR each agent output
reset new agent output buffer
Message List
IF reallocate is true
reset current function agent swap buffer
Grid Block
get_next_message() with
load next message group
call agent function kernal wrapper
FOR each message output
perform prefix sum scan on message swap buffer
scatter append optional messages from message swap to
message list
update internal message counter
Figure 7 - Message group loading, when requesting the next
message from the beginning of a new message group.
As agent and hence message list sizes are liable to change
through out the simulation process, it is important to consider
thread path divergence to avoid any deadlock problems. Unused
threads are likely and are a result of the total number of agents
not being a multiple of the thread block size. Rather than leave
these threads idle it is essential that they follow the same path as
occupied threads within the block. Whilst this results in agent
data beyond the last agent in the list being processed with the
agent function, the path these threads follows ensures that full
message groups are loaded into shared memory. The alternative
to this would be that idle threads refrain from updating agent
data, load messages and wait for synchronisation. Unfortunately
__syncthreads() method is dependant on location so it is a
requirement that these threads follow the same conditional
branch paths. To ensure that the blank agent data exists beyond
the last agent in the list, the FLAME X-Parser has been modified
to check the <bufferSize> value for agents is a multiple of
the thread block size. As message loading is obviously sensitive
to any divergence between thread paths, breaking from within
the message loop is also expressly forbidden and may result in
unexpected behaviour.
Figure 8 demonstrates the CPU host pseudocode for a complete
agent function, including the processing steps required for any
function input or outputs. Setting of the shared memeory size is
required when multiple message inputs are utilsed per agent
function. This is accomplished by allocating enough shared
memory to hold data for the largest message size, plus and
additonal integer value. The additional integer is required to hold
IF reallocate is true
perform prefix sum scan on agent list
scatter alive agents to agent swap buffer
exchange swap buffer pointer with agent list pointer
update internal agent counter
FOR each agent output
perform prefix sum scan on new agent buffer
scatter append new agents from new agent buffer to
agent list
update internal agent counter
Figure 8 - CPU host pseudocode for a complete agent
The following results are based on the FLAME Circles
benchmarking model which consists of a single Circle agent
type and single location message. Three agent functions are
used to output messages, input messages and move the agent. All
results have been obtained on a single PC with an AMD Athlon
2.51 GHz Dual Core Processor with 3GB of RAM and a GeForce
9800 GX2. Whilst the GX2 card consists of two independent
GPU cores only a single core has been used for CUDA
processing with the other handling the active display. This
technique allows the circumvention of the windows watchdog
timer which halts GPU kernels exceeding five seconds in
execution time. In future the X-Parser will be modified to output
an alternative Linux build script which will avoid this problem
with single GPU PCs.
Table 1 demonstrates the relative speedup of our work in direct
comparison with the Circles model running in FLAME, on a
single PC. The times are in milliseconds and represent the
processing time of a single iteration, excluding timing for file IO
or initial GPU data transfer. Whilst there is some initial
fluctuation up to population sizes of 8192 agents, the overall
speedup converges towards roughly 90%. Figure 9 helps us to
understand where our implementation has gained significant
performance. It represents the relative performance of three
implementation examples varying with respect to the
optimizations described in this paper. The lowest performance
example in this figure (GPU AoS) represents the simulation time
of a single iteration of the Circles model without coalesced
memory access, or use of shared memory message reading. The
next example (GPU SoA -SM) represents the performance of the
same model utilizing coalesced memory reading through the
agent function kernel wrappers. Likewise with the first example
messages are read directly from global memory without utilising
shared memory. Despite an average speedup over the first
example by a factor of 10 the final example (GPU SoA +SM)
highlights the significant impact of utilizing shared memory for
message processing.
Table 1
Relative Speedup
lower population simulations. In higher population sizes there is
a much greater likelihood of divergence between threads as more
messages within the limited environment size will be processed
per agent. As a result the higher population models perform
optimally with a small block size of 64. Lower population
simulations demonstrate a much more varied optimal block size
as the likelihood of thread divergence is reduced.
Table 2
In order to evaluate the birth and death functionality of our
framework, Figure 10 represents the results of two extensions to
the Circles model. These extensions include death reallocation
(Blue) and agent birth allocation including death reallocation
(Green) respectively. The Y-axis values in Figure 10 are
expressed as a performance percentage in comparison to the
standard Circles model presented previously in Table 1. Whilst
there is some indication of a performance penalty for agent birth
and death functionality in lower population simulations, this is
less evident for higher populations. This is simply explained by
considering the percentage of time spent processing messages. In
higher populations this O(n²) operation quickly becomes the
performance bottleneck, where as within smaller population sizes
the effect of birth and death post processing functionality are
more easily observed. As with agent birth and death, the effect
of agent function optional output messages has been evaluated. In
this case the Circles model functionality was updated to allow
agents to output a message depending on some random value
assigned to them. In all populations sizes the effect of post
processing on the message list is vastly diminished by the
significant gain during the input messages stage caused by the
message list size reduction.
Reallocate with Births
Population Size
Figure 9 - Relative Speedup of 3 experimental tests
demonstrating SoA and shared memory performance.
Table 2 considers further performance optimization by
demonstrating the effect of varying CUDA thread block sizes on
the Circles model. Although the naive estimation would assume
that the largest thread block size would give best performance
due to the increased thread cooperation, experimentally this is
not true. This is due to the divergent paths of threads within the
same block, and a result of agents performing additional
computations on messages depending on a radial proximity. As
each batched message group load to shared memory requires
thread synchronization, it is far more likely that larger block
sizes will have a greater divergence across the block. With
respect to synchronisation more threads in the block results in a
larger number of idle waiting threads. This is highlighted in table
2 by the difference between the optimal block sizes of higher and
% of Standard Perfomance
Population Size
Figure 80 - Performance of agent reallocation as a
percentage of non allocating performance.
In this paper we have presented a GPU framework for ABM that
utilizes existing agent specification techniques to produce
efficient CUDA code. Unlike previous GPU alternatives [5, 19],
important agent birth and death functionalities have been
implemented and are guaranteed to succeed in linear time.
Agent’s communication through messages has been implemented
with efficient use of shared memory, a resulting speedup over the
FLAME frameworks original CPU implementation of over 80
times has been achieved.
In the future we expect to extend our work in the following ways.
Firstly it is highly desirable to consider differing communication
algorithms such as that demonstrated by Richmond & Romano
[19]. This will significantly reduce the number of message
interactions through the use of spatial partitioning. The
introduction of partitioning would not only improve single GPU
performance, but would aid in the production of a desirable
multiple GPU implementation. Secondly visualization and real
time analysis tools will be implemented which will make use of
simulation data readily available on the GPU. In addition to this
it is expected that the public release of this framework will result
in the implementation and evaluation of more advanced models
which will provide feedback for further optimisations.
[1] Barnard, J., Whitworth, J., and Woodeard, M. 1996.
Communicating X-Machines. Journal of Information and
Software Technology, Vol 38. no. 6
[2] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K.,
Houston, M., and Hanrahan, P. 2004. Brook for GPUs:
Stream Computing on Graphics Hardware. Proceedings of
SIGGRAPH 2004, Los Angeles, California. August 8-12
[9] Harris, M., Coobe, G., Scheuermann, T., and Lastra, A.
2002. Physically-Based Visual Simulation on Graphics
Hardware. In Procedings SIGGRAPH 2002 / Eurographics
Workshop on Graphics Hardware 2002.
[10] Harris, M., Sengupta, S., and Owens, J. 2007. Parallel
Prefix Sum (Scan) with CUDA. GPU Gems 3. Chapter 39.
[11] Howes, L. 2007. Loading Structured Data Efficiently With
CUDA. NVIDIA Technical Report.
[12] Jansen, T. 2007. GPU++: An Embedded GPU Development
System for General-Purpose Computations, PhD Thesis,
[13] Kolb, A., and Cuntz, N. 2005. Dynamic Particle Coupling
for GPU-based Fluid Simulation. Proc. 18th Symposium on
Simulation Technique, ISBN 3-936150-41-9, pages 722-727
[14] NVIDIA Corporation. 2007. CUDA Programming Guide
Version 2.0,
[15] NVIDIA Corporation. 2007. CUDA Quickstart Guide
Version 2.0,
[16] Nyland, L., Harris, M., and Prins, Jan. 2007. Fast N-Body
Simulation with CUDA, GPU Gems 3, Addison Wesley
Professional, Chapter 31
[3] Chatelain P., Cottet G.H., Koumoutsakos P.. 2007. Particle
Mesh Hydrodynamics for Astrophysics Simulations, Int. J.
Modern Physics C, 18, 4, 610-618
[17] Owens, J., Luebke, D., Govindaraju, N., Harris, M., Krüger,
J., Lefohn, A., and Purcell, T. 2005. A Survey of GeneralPurpose Computation on Graphics Hardware. In
Proceedings of Eurographics 2005, State of the Art Reports,
pages 21-51
[4] Coakley, S., Smallwood, R., and Holcombe, M. 2006. Using
{X}-Machines as a Formal Basis for Describing Agents in
Agent-Based Modelling, Proceedings of the 2006 Spring
Simulation Multiconference, April 2006, pages 33-40.
[18] Reynolds, C. 2006. Big fast crowds on PS3. In Proceedings
of the 2006 ACM SIGGRAPH Symposium on Videogames
(Boston, Massachusetts, July 30 - 31, 2006). sandbox '06.
ACM, New York, NY, pages 113-121
[5] D'Souza, R. M., Lysenko, M., and Rahmani, K. 2007.
SugarScape on steroids: simulating over a million agents at
interactive rates. Proceedings of Agent2007 conference.
Chicago, IL
[6] Eilenberg, S. 1974. Automata, Languages and Machines.
volume A. Academic Press.
[19] Richmond, P., and Romano, D. 2008. Agent Based GPU, a
Real-time 3D Simulation and Interactive Visualisation
Framework for Massive Agent Based Modelling on the
GPU. Proceedings of International Workshop on
Supervisualisation 2008. Kos Island, Greece. June 2008. In
[7] Green, S. 2005. GPU-Accelerated Iterated Function
Systems, International Conference on Computer Graphics
and Interactive Techniques, ACM SIGGRAPH 2005
Sketches, Article 15.
[20] Rouff, C., Hinchey, M., Truszkowski, W., and Rash, J.
2005. Verifying large number of cooperating adaptive
agents. 11th International Conference on Parallel and
Distributed Systems. June 2005.
[8] Green, S. 2007. CUDA Particles, NVIDIA Whitepaper,
November 2007.
[21] S. Wolfram. 2002. A New Kind of Science. Wolfram Media.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF