A DDR2 Controller for BEE3
Chuck Thacker, Microsoft Research
Issued 1 March 2007
Last Revised November 11, 2008
Corresponds to design V1.1
Introduction
The BEE3 system contains four Virtex-5 FPGAs. Each FPGA controls four DDR2 DIMMs, organized as
two independent 2-DIMM memory channels. The DDR2 controller described in this document implements
one memory channel.
Because we use DIMMs with capacities up to 4 GB, the controller is quite different from earlier controllers.
These DIMMs use x4 DRAM chips, with 36 chips per DIMM. We use RDIMMs, which reduce the
loading on the address and control lines at the expense of an extra cycle of CAS latency. The DIMM data
and strobe signals are treated as lanes, with each lane containing four bidirectional DQ (data) bits and one
bidirectional (although we use it as unidirectional) differential DQS (strobe) signal. A lane corresponds to
the signals from a particular pair of DRAMs on each DIMM. There are a total of eighteen lanes in six I/O
pin banks, with three lanes per bank. The controller is designed to operate Micron 4GB DIMMs, each of
which contains 36 DRAM chips in two ranks, although it has been tested with DIMMs from other vendors.
Each DIMM consists of two ranks, for a total of four ranks per controller.
The interface consists of three FIFOs that exchange address and control, read data and write data between
the controller and the rest of the FPGA. The interfaces to the user logic are asynchronous with respect to
the controller itself, and have their own clocks. The controller itself has two clock domains: MCLK,
which runs at 250 MHz, and CLK, which is half the rate of MCLK.
The FIFOs are implemented with block RAMs that also do ECC over the entire data path. The user logic
data width is 128 bits; the user logic supplies two 128 bit words for each write, and extracts two 128 bit
words for each read.
The address is 28 bits in length. This is the address of a 32 byte quantity, so the overall address space is
8GB. The address is accompanied by a single bit indicating whether the operation is a Read (1) or a Write
(0). Commands are processed in order, and read data is supplied in the order that the addresses were
supplied.
The controller works in concert with a small 36-bit RISC processor (TC5) that handles memory
initialization, data pin calibration, and refresh. This module must be instantiated in user designs. This
processor can also run a full-speed tester that will not appear in most designs.
To provide reliable data transfers between the RAMs and the controller, the delays of the input data pins
are adjusted to sample the data at the center of the valid window. This calibration procedure can be run at
any time, and is quite fast (~1000 MCLK cycles). Because the trace lengths of the BEE3 are carefully
matched, we don’t need to use DQS as a strobe for read data, since we know the cycle in which it will
appear after the command is sent to the RAMs. Note that the controller uses location 0 in rank 1 during
calibration, so user logic should not make use of this location.
The remainder of this document is organized as follows: First, the interface between the user logic and the
controller is discussed. The implementation of the controller hardware is then discussed. Finally, the TC5
that oversees the operation of the controller is described, as is the assembler used for TC5 programming
Copyright © Microsoft 2007, 2008
1
and the shell program used to communicate via RS232. A full speed RAM tester used for controller
debugging is also discussed briefly.
Controller Interface
The Verilog interface to the controller consists of the following signals:
module ddrController(
input CLK, //Clocks
input MCLK,
input MCLK90,
//User logic interface
input [27:0] Address,
input Read, //1 = Read, 0 = Write
input WriteAF,
output AFfull, //can't do anything if full
input AFclock,
output [127:0] ReadData,
input ReadRB,
output RBempty, //can’t read if empty
output RBfull,
input RBclock,
output SingleError,
output DoubleError,
input [127:0] WriteData,
input WriteWB,
output WBfull, //can't write if full
input WBclock,
//Signals to/from the DIMMs
inout [71:0] DQ, //the 72 DQ pins
inout [17:0] DQS, //the 18 DQS pins
inout [17:0] DQS_L,
output [1:0] DIMMCK, //differential clock to the DIMM
output [1:0] DIMMCKL,
output [13:0] A, //addresses to DIMMs
output [2:0] BA, //bank address to DIMMs
output [3:0] RS, //rank select
output RAS,
output CAS,
output WE,
output [1:0] ODT,
output ClkEn, //common clock enable for both DIMMs. SSTL1_8
output DIMMreset, //low true
//Other Signals
input StartDQCal0, //From TC5> Start the I/O bank-specific FSMs
input [33:0] LastALU, //from TC5
input injectTC5address, //From TC5
input InhibitDDR, //From TC5.
input Force, //From TC5. Forces calibration data pattern.
input KillRank3, //From TC5. Causes a tester error
input ResetDDR, //From TC5
input Reset, //OR of reset (Global and FPGA-specific)
output reg CalFailed, //To TC5. Calibration failed
input DDRclockEnable //From TC5
);
The clocks are supplied by a PLL that is instantiated in the top-level module (ddrTop.v). This PLL is
driven by the 100 MHz system clock. The TC5 and the tester (if used) are also instantiated at the top level.
Copyright © Microsoft 2007, 2008
2
Read and Write are the only operations the user logic issues. The controller itself handles all other
operations required by the DRAMs.
To do a Read, the user logic loads an address accompanied by Read = 1 into the Address FIFO, which
triggers the control machine that does the read. The data is stored in two successive locations in the Read
Data FIFO. The FIFO may be read whenever the FIFO is nonempty. Reads may only be issued when
AFfull is false.
To do a write, the user logic loads the Write Data FIFO with two 128-bit words, and loads the Address
FIFO with the address accompanied by Read = 0. Writes may only be issued if AFfull and WBfull are both
false.
Controller Modules
The controller Verilog is hierarchically structured. The main module (ddrController.v) instantiates several
other modules: There are six instantiations of the logic associated with the I/O banks (ddrBank.v), the three
FIFOs (AF.v, RB.v, WB.v), and the logic that keeps track of open RAM banks (OBCam.v). Some of these
use a third level of “helper” modules.
The DDR controller must be instantiated by a top-level module that generates the clocks, instantiates the
TC5, and defines the signals that drive the DRAM DIMMS (and a few other signals) as top level interface
signals. A ucf file is supplied to bind these signals to their locations on the BEE3 board.
Controller Implementation
The controller data and address path is shown in Figure 1:
Copyright © Microsoft 2007, 2008
3
CLK (MCLK/2) Domain
MCLK Domain
InjectTC5address
Address FIFO
(1 BRAM)
TC5 LastALU[33:0]
CLK
Cmd, Address
AFfull
numOps
Open Bank
Logic
CLK
Stall
Timing Limit
Enforcement
WriteAF
AFclock
Command Encoder
Address[27:0],
Read
Register
Register
Cmd, Address
Normal
Register
Register
d1
CLK
d2
Alt
MCLK
CLK
Address, Command to DIMMs
Write Data FIFO
(2 BRAMs)
ForceA
RAM
ODDR
DQ[71:0]
ReadWD
ECC Encoder
WBfull
WBclock
Register
WD[143:0]
WriteWB
ISERDES
Register
Read Data FIFO
(2 BRAMs)
ReadData[127:0]
144'haaaa..
.
~MCLK
MCLK90
RAM
MCLK90
MCLK90
RBfull
WriteRB
MCLK90
ECC Decoder
WriteData[127:0]
DRAM
DIMMs
RBempty
ReadRB
DoubleError
SingleError
RBclock
Figure 1: DDR Controller
The lower part of the figure shows the data path, the upper part of the figure shows the address and
command paths. Most of the data path is clocked by MCLK90, a clock whose phase lags MCLK by 90
degrees. The portion of the address path that processes commands is clocked by CLK, and the addresses
and commands are issued to the DIMMs at MCLK rate. The reason for this organization is that the DIMMs
themselves want addresses and commands that change on MCLK. The DIMMs themselves are clocked by
DIMMCLK, which is generated in an ODDR (it is essentially MCLK180). The DIMM therefore samples
the commands and addresses at the center of each bit. The ISERDES that recovers the data is clocked by
~MCLK. This gives the best valid window, while making transfers to the MCLK90 domain meet timing
easily.
The section of the control logic that processes commands from AF operates at half the MCLK rate since
this logic cannot complete in a single MCLK cycle. This does not limit performance, since read and write
commands require two MCLK cycles to transfer a full data burst. When a command arrives at the output
of the Address FIFO, the addresses bits are assigned to Column (9 bits), Bank (3 bits), Row (13 bits), and
Rank (2 bits). There is a stage of pipelining to allow the calculation of numOps and the timing limit
enforcement to take one CLK cycle each.
A given command can result in one, two, or three commands to the RAMs. The number of commands is
determined by the Open Bank Logic.
Open Bank Logic
The performance of DDR2 DRAMs is critically dependent on keeping several banks open in each rank of
DRAMs. When a bank is open, read or write operations in that bank can be started without delay. If a
bank is not open, an ACT (activate) command must be sent to the DRAMs to bring the data in the desired
row into on-chip registers, where it can be quickly accessed by read and write operations directed to a
Copyright © Microsoft 2007, 2008
4
column within that row. If a row other than the currently open one is accessed in an open bank, the bank
must be PRECHARGEd, and an ACT must be issued to the new row before doing the access. All banks in a
rank must be PRECHARGEd before doing a REFRESH. The Open Bank logic only services reads and writes –
refresh is handled by the TC5, as described later.
{Afrank[1:0],
AFbank[2:0]}
1'b1
AFrow[13:0]
WD
32 x 1
Valid bit
array
32 x 14 LUT RAM
RA, WA
CLK
DoReset
DoOp
CLK
Clr8
WE
WE
RD
row[13:0]
valid
14-bit Compare
Conflict
Hit
Figure 2: The open bank logic.
The logic is shown in Figure 2. It consists of a 32 word x 14-bit LUT RAM that holds the row address for
each bank, plus a valid bit (32 flip-flops total) for the register that indicates that the rank/bank is open.
When a request arrives, the register for the requested rank/bank is accessed. The result is compared with
the requested row address. If they are equal (and the valid bit is set), the request may be issued
immediately. If the valid bit is clear, the bank must be activated. The valid bit is set, and the requested row
is loaded into the RAM location corresponding to the requested bank. If the valid bit is set but the
requested row does not match the currently open row, a Precharge must be issued to the requested bank,
and then an Activate must be issued to open the bank.
The open bank logic is in the CLK domain, since it requires more than one MCLK cycle for access. It
produces two bits (Hit and Conflict) that determine the number of RAM operations to issue for each
command, according to Table 1:
Hit
0
1
0
Conflict
0
0
1
Operations issued
2: Activate, Operation
1: Operation
3: Precharge, Activate, Operation
Table 1: Commands per user operation
The logic in the CLK domain can issue 1 or 2 commands each CLK cycle. These are referred to as the
“normal” and “alternate” command and address. The downstream logic (in the MCLK domain) sends the
normal and alternate command/address back-to-back (normal is first) during two MCLK cycles.
If there is only one command to issue, the sequence is Nop, Operation. If there are two, the sequence is
Activate, Operation.
If three commands are needed, two CLK cycles are used, controlled by the main FSM. This FSM has only
two states: Idle and Active. Normally, all commands are issued in Idle, but if three commands are needed,
Copyright © Microsoft 2007, 2008
5
the FSM transitions to Active. The first command issued is Precharge, Nop. In the
Active state, the commands (which will be issued after the Precharge-to-Active timing constraint is met)
are Activate, Operation.
At any time, the TC5 can inject commands and addresses into the command pipeline. Before doing so, it
asserts InhibitDDR, which keeps the controller from executing any further user logic operations. It then
injects the command via the multiplexer shown in Figure 1. This is used for DRAM initialization, which is
done at power-up, calibration, which must be done before using the controller, and refresh. Refresh is done
to one rank every 2us, to meet the 8 us refresh interval for a single rank while minimizing the peak power
drawn by the DIMMS.
When a Refresh is done, the TC5 issues PrechargeAll, Refresh for the current rank. The valid bits (in the
Open Bank logic) for all banks in that rank are cleared, since all banks are closed by Refresh.
Timing Limit Enforcement
The controller is responsible for issuing operations to the DIMMs and ensuring that the timing
requirements of the RAMs are met. DDR2 RAMs have a number of timing limitations that are set by their
specification and by the Mode Registers. Table 2 summarizes the various limits. It is based on the Micron
MT47H256M4 RAM, which comes in a number of speed grades. These RAMs are components in the
MT36HTF25672(P) RDIMM. The table shows the number of clocks for each parameter at speeds of 266
and 333 MHz. We require that if two DIMMs are used by a controller, they have identical characteristics.
In this document, and in the Verilog for the controller, we assume the values for the 266 MHz variant.
Table 2: DRAM Timing Limits
Cmd/Addr
CL1
AL2,5
BL1
tCCD3
tRC3
tRRD3
tRCD3
tFAW3
tRAS3
tWR3
tWTR4,9
tRTW4,6
tRTP3,8
tWTP7
tRP3
tRFC3
tRPA3
tMRD3
Refresh
tREFI
ODT
tAOND
tAOFD
tMOD
ns10
Clocks @ 266
Clocks @ 333
Meaning
7.5 ns
15 ns
37.5 ns
40 ns
15 ns
7.5 ns
7.5 ns
15 ns
127.5 ns
15 ns + tCK
-
(3-6) 4
(0-4) 3
(4/8) 4
2
(14.2) 15
2
4
10
(10.6) 11
4
10
4
5
12
4
34
5
2
(3-6) 5
(0-4) 4
(4/8) 4
2
(18.3) 19
(2.5) 3
5
(12.5) 13
(13.3) 14
5
13
4
7
15
5
(42.5) 43
6
2
CAS Latency
Posted CAS additive latency
Burst length
CAS-to-CAS delay
ACT-to-ACT (same bank) delay
ACT-to-ACT (different bank)
ACT-to-Read or Write
4-bank ACT limit
ACT-to-PRECHARGE
Write recovery (from burst end)
Write-to-read
Read-to-write
Read-to-PRECHARGE
Write-to-PRECHARGE
PRECHARGE period
REFRESH-to-ACT
PRECHARGE ALL period
LOAD MODE cycle time
7.8µs
2080
2600
Refresh interval
-
2
2.5
12
2
2.5
12
On delay
Off delay
ODT enable from MRS
1: Set by MR
2: Set by EMR 1
Copyright © Microsoft 2007, 2008
6
3: Applies only to operations in the same rank.
4: Applies to operations both within the same rank and to operations in different ranks.
5: We set AL = tRCD -1.
6: This value is calculated: (BL/2 + 2). See RAM spec Figure 28.
7: This value is calculated: (AL +CL -1 + BL/2 + (tWR (ns) / tCK). See RAM spec Figure 41.
8: This value is calculated: (AL + BL/2 + (tRTP (ns) / tCK) -2. See RAM spec Figure 26.
9: This value is calculated: (AL + CL -1 + BL/2 + tWR (clocks). See RAM spec Figure 40.
10: From RAM spec Table 48, unless calculated.
Note that the command/address register on the DIMM increases CL by 1 clock, so the time to valid data (R/W) also increases by 1.
Meeting the timing restrictions is tricky, since some limits (tWTR and tRTW) are used to prevent inter-rank
conflicts on the DQ bus, and others are needed to meet the internal requirements within the RAMs of a rank
for the operations within an open bank.
The tWTR and tRTW limits are to prevent bus contention, and are enforced by two counters (Tlr and Tlw)
that maintain the time (in cycles) since the last read or write.
We meet the tCCD, tRRD, and tMRD requirement trivially, since we can’t issue an OP more frequently than
every other cycle. The tRCD limit is met (and we can issue an OP immediately after an ACT) since we use
additive CAS latency with AL = tRCD -1.
tRP determines the PRECHARGE-to-ACT delay. Since in normal operation, a single-bank PRECHARGE is
always immediately followed by an ACT to the same bank, we need only to keep a global counter (Tlp), to
ensure the limit is met.
We meet tRC (the same-bank ACT-to-ACT delay) automatically by meeting tRAS followed by tRP.
During a REFRESH sequence, all banks in a given rank must be PRECHARGEd, and tRPA applies between the
PRECHARGE ALL and the REFRESH. This is enforced by the same global counter (Tlp) used for single-bank
PRECHARGEs.
Before a PRECHARGE is issued, tRAS (ACT-to-PRECHARGE), tRTP (READ-to-PRECHARGE) and tWTP (WRITE-toPRECHARGE) must all be met. For a single-bank PRECHARGE, these apply only to the bank being
PRECHARGEd, but for PRECHARGE ALL, they apply to the most recent operation for the entire rank. An
optimal solution would keep three counters for each potentially open bank (16). Instead, we keep four
counters per rank (Tact, Tread, Twrite, and Tref, a total of 16 instead of 48). These count the number of
cycles since the last ACT, READ, or WRITE or REFRESH to the rank. Meeting the limit for the most recently
issued operation of a given type means that it will be met for all earlier operations of the same type.
The relevant counter is loaded with the timing limit, and counts down on each CLK cycle. A separate bit
(TxxZ) is provided with each counter to indicate the zero state (this avoids having a test for zero in the
critical path). The timing limits for the requested rank in combination with the operation needed are
combined to form the signal Stall, which keeps any command from being issued in Idle:
assign Stall = (~TlwZ & ReadCommand) |
//Write to read
(~TlrZ & ~ReadCommand) |
//Read to write
(~TactZ[rankIn]
& numOps[1]) |
(~TwriteZ[rankIn] & numOps[2]) |
(~TreadZ[rankIn] & numOps[2]) |
~TrefZ[rankIn];
The controller doesn’t enforce the tFAW limit, since we probably can’t run the controller fast enough for
this to be a problem
Timing Chain
When a command is issued, the logic in each of the six I/O banks must be notified when data is to be sent
on a Write, or when data will arrive as a result of a Read. The signals that communicate between the
centralized control logic and the I/O banks are ReadBurst and WriteBurst.
Copyright © Microsoft 2007, 2008
7
Figure 3 shows the timing for writes, Figure 4 shows the timing for reads.
MCLK
MCLK90
cmdd1 Valid
wd0
wd1
wd2
wd3
wd4
WriteBurst
WBd1
ReadWB
FIFO advances
WD valid
In ddrController
In ddrBank
WBd1
WDpipe valid
WBd2
ReadWB
preDQSenL (low)
DQSenL (in dqs_iob)
ChangeDQS
DQSout for DQS
DQSout for DQS_L
RAM
Requirements:
cmdd2 to DIMM
DIMM samples cmdd2
RAM samples DIMM-registered cmd
AL + CL -1 = 6 cycles
DQS
DQS_L
DQ
0
1
2
3
Figure 3: The write timing chain.
The total delay from the “nominal” time the DQ data is provided by the RAM and the time it is sampled by
~MCLK in the first rank of ISERDES flip-flops is 1.5 MCLK. This is made up of the RAM delay, the
trace delay, the input buffer delay, and delay added by the input IDELAY on each pin. The calibration
procedure (described below) adjusts the IDELAY on all pins to center the valid data on ~MCLK.
Copyright © Microsoft 2007, 2008
8
MCLK
MCLK90
cmdd1 valid
cmdd2 to DIMM
DIMM samples cmdd2
RAM samples DIMM-registered command
RL = AL + CL = 7 cycles
DQS
DQS_L
0
DQ (nominal at RAM)
1
2
3
RAM + Trace + InBuffer + IDELAY
0
DQdelayed
1
2
3
ISERDES samples DQdelayed
ISERDES advances
even RB input
0
2
odd RB input
1
3
RD valid
rd0
rd1
rd2
rd3
rd4
rd5
rd6
rd7
rd8
rd9
rd10
rd11
rd12
ReadBurst
rd13
WriteRB
RB FIFO advances
Figure 4: The read timing chain.
Calibration
Most of the calibration logic is replicated six times, with one copy per I/O pin bank. Each of the six I/O pin
banks contains the data handling for three 4-bit lanes, corresponding to one RAM’s DQ pins and its
associated DQS pin. Logic at each ISERDES compares the received data with the calibration pattern, and a
small FSM at each bank controls the bank’s role in calibration.
When the TC5 determines that calibration is needed (exactly how is TBD), it Inhibits the controller, then
injects a Refresh to close all banks in the rank used for calibration, then injects an Activate, Write
command with Force = 1. This writes the calibration pattern into location 0 of rank 1, bank 0. It then
injects 64 Read/Nop commands into the controller. Finally, it waits for the banks to finish calibrating, and
checks for success or failure. On failure, the TC5 routine returns to caller + 1.
Copyright © Microsoft 2007, 2008
9
Idle
Start
~WriteRBtime | (DlyCnt != 0
&& ~allGood)
Read1
WriteRBtime && DlyCnt == 0
WriteRBtime && allGood
[1]
FailX
[2]
Read2
WriteRBtime && DlyCnt != 0
&& ~allGood && minWin == 0
WriteRBtime && DlyCnt == 0 &&
MinWin == 0
WaitEnd
[1]: WriteRBtime && DlyCnt != 0 &&
~allGood && MinWin != 0
[2]: WriteRBtime && DlyCnt == 0 &&
MinWin != 0
~WriteRBtime |
(DlyCnt != 0 && allGood)
DlyCnt != 0
DlyCnt = 0
DecWS
WS != 0
WS == 0
DecWW
WW != 0
WW == 0
Idle
Figure 5: The DQ calibration state machine.
The calibration sets the delay lines in each of the pins in each I/O bank. The goal of calibration is to adjust
the data delay so that the strobe that samples the data is in the center of the valid data window. The
IDELAYCTRL that sets the overall delay is clocked by MCLK, so each tap represents 1/64 of the MCLK
period1.
The DQ calibration machine is shown in Figure 5: In addition to the FSM, there are four counters:
DlyCnt, which counts the reads that have been received, WS (window start) that is used to detect the start
of the valid data window for the pins in the I/O bank, WW (window width), which determines the width of
the valid data window, and minWin, which is the minimum acceptable window width (5 taps).
The procedure works as follows: When the TC5 issues a Start, all IDELAY tap counters are reset, DlyCnt
is set to 63d, and WS and WW are zeroed. In state Read1 we are looking for the start of the valid window.
Logic at each ISERDES tests the data against the known calibration pattern, and produces allGood if all
1
This is out of spec. The Virtex 5 data sheet says this clock must be 200 +/- 10 MHz. But how could
anything in the silicon care much?
Copyright © Microsoft 2007, 2008
10
lanes in the I/O bank match it. For each read that is received, WS and the IDELAY tap counters for all pins
are incremented. If the end of the delay line is reached before the window is found, the FSM transitions to
FailX, where it hangs until the controller is reset.
When the valid window starts, the FSM transitions to Read2, where it waits for the end of the window.
During Read2, WW is incremented for each allGood read received. The transitions out of Read2 state are
complex, but the basic idea that if the window ends before the end of the delay line is reached, and the
minimum window width has been achieved (minWin == 0), the FSM transitions to WaitEnd, where it waits
for the group of 64 reads to complete (without incrementing WS or WW). On the other hand, if the delay
line ends before the minimum acceptable window is found (minWin != 0), the FSM transitions to FailX. If
the delay line ends in Read2 but the window width is acceptable, the FSM transitions directly to DecWs.
When DecWS is entered, the IDELAYs are again reset. In DecWS, WS is decremented to 0 while the
IDELAY tap count is incremented. When WS goes to zero, the FSM enters DecWW. In DecWW, WW is
decremented on every clock, but the IDELAY tap counter is only incremented on every other clock. When
WW becomes zero, the FSM returns to Idle. The effect of incrementing the tap counter on every other
clock in DecWW is to center the data in the valid window, which is the desired goal.
The TC5 waits for the I/O banks to finish setting the correct tap counts (it waits a conservative delay), then
checks if any I/O bank has failed (i.e., is in state FailX). If so, it reports failure.
TC5
TC5 (“Tiny Computer v5”) is responsible for initializing and refreshing the DDR2 RAMs. It also starts
and monitors the (optional) RAM tester. TC5 must be instantiated with the DDR
controller, but it is quite small and using it simplifies the controller substantially.
Figure 6 is a block diagram of the TC5.
ALU[35:00]
0, PCinc
Jump
0
1
Z,N
InRdy
WD[35:00]
IM[5:3]
Waddr[8:0]
IM[18:10]
Skip Test
WD
Waddr
doSkip
Baddr
Aaddr[8:0]
Aaddr
Ph0
~Ph0
~Ph0
Registers
512 X 36
PCmux
Wclk
Aclk
Ph0
Bclk
Aout
PC
PC
Bout
InData[35:00]
+1
+2
PCinc2
Add, Sub, Logic
IM[9:7]
Cycle
IM[6:5]
doSkip
Jump
PCsel
Z,N
Ph0
LastALU[8:0]
IM[27:19]
IM[1:0] = 1
0
0,IM[35:28]
1
0
R Addr
1
Instruction
Memory
1K X 36
IM[1:0] = 2
Aaddr
Waddr
Ph0
Rclk
IM[35:0]
Figure 6: TC5
Copyright © Microsoft 2007, 2008
11
Instruction Format
The instruction set (Figure 7) is very simple and regular. All instructions have the same format, including a
destination register Rw and two source registers Ra and Rb. Most logical and arithmetic instructions are of
the form Rw <= Shift (Function (Ra, Rb)), with a conditional skip on the condition specified by the Skip
field.
Rw
Ra
Rb
28 27
35
Function:
0: A + B
19 18
Shift:
Skip:
0: Never
0: No shift
1: A - B
1: RCY 1
1: ALU < 0
2: A & B
2: RCY 9
3: RCY 18
2: ALU = 0
3: A & InData
3: InRdy
4: A | B
4: Always
5: A | ~B
5: ALU >= 0
6: A ^ B
6: ALU # 0
7: A & ~B
7: ~InRdy
Function
10 9
Shift
7
6
5
Skip
4
Op
2
1
0
Op:
0: ALUop:
RF[Rw] <= Shift(F(RF[Ra], RF[Rb])), Skip if condition
1: ReadRFIndexed:
RF[Rw] <= Shift(F(RF[LastALU], RF[Rb])),
Skip if condition
2: WriteRFIndexed:
RF[ LastALU] <= Shift(F(RF[Ra], RF[Rb])),
Skip if condition
3: Jump: RF[Rw] <- PC + 1, PC <- F(RF[Ra], RF[Rb])
Figure 7: TC5 instruction format
Note that Ra and Rb are 9 bits, but Rw is only 8 bits wide. Read-only constants are placed in the upper 256
locations of RF. Note that these registers can be written using indexing, but this is a bit cumbersome.
Indexed addressing of the read ports of RF is done by using the value of the ALU produced by the
preceding instruction as a register address. Note that LastALU is not loaded if Op = Jump, which allows
the target of a jump to do an indexed read.
The TC5 includes an “event timer” mechanism that allows periodic tasks to share the processor. This
makes it possible to do things like refreshing the memory periodically while sending characters to the
RS232 at the same time. Event timers aren’t as complex as true interrupts, but they serve the same purpose,
without the need for a stack.
Note that the instruction memory (IM) is read-only. This is not a problem, since programs are patched into
the FPGA bitstream. A simple assembler is provided to assemble TC5 programs, which are then added to
the .bit file produced by the ISE using the Xilinx data2mem tool.
Instruction Execution
TC5 instructions are executed in one cycle of the Ph0 clock, which is MCLK/4. This clock has a duty
cycle of 0.375. Most registers are loaded on the rise of Ph0, but the RF addresses are loaded at the fall of
Ph0 (after the instruction itself has been fetched). The timing is completely deterministic.
In general, an instruction executes as follows: The contents of register Ra are read from the register file
and used as argument A to the ALU. The contents of register Rb are read from the register file and used as
argument B to the ALU. The B input of the ALU is taken from the InData bus if the “&&” Function
(Function 3) is executed. The ALU computes a result by combining A and B according to the Function
field followed by a shift according to the Shift field. The result of the ALU is then used in various ways,
typically by writing it into register Rw in the register file.
Copyright © Microsoft 2007, 2008
12
The Function field controls the add/sub/logic unit, which combines arguments A and B to produce a 36 bit
result.
The Shift field controls the shifter, which takes the 36 bit add/sub/logic result and applies a right cyclic
shift of 0, 1, 9, or 18 bits.
The Op field controls major effects of the instruction by, among other things, selecting the actual value to
be written into register Rw in the register file.
An ALUop selects the result of the ALU to write into Rw and enables the skip logic to skip the following
instruction if the skip condition is true. The skip is done by selecting PC+2 rather than PC+1 as the new
value of the PC. Skipping the following instruction is the only provision for conditional execution apart
from a computed jump.
A ReadRFIndexed Op selects the “a” port of RF with the the low 9 bits of the ALU value of the previous
instruction (Ra is ignored). Otherwise, it is like a normal ALU Op.
A WriteRFIndexed Op selects the “w” port of the RF with the low 9 bits of the previous operation. Note
that Rw is only 8 bits (it is zero extended), while the LastALU value is 9 bits. This provides a way to write
the upper 256 registers if needed.
A Jump Op selects PC+1 to write into Rw and selects the result of the ALU as the new value of the PC.
Skips are suppressed. Software can interpret the value stored in Rw as a return address (a subroutine call)
or it can ignore it (a Jump). There is no hardware-supported call stack. Since the output of the ALU is
selected as the new PC, all jumps are effectively computed jumps, but we expect that most often a jump
will arrange to get its target address from a preloaded constant register
There are relatively few ALU functions. The shifter is placed after the add/sub/logic unit. It too has
limited capabilities.
Register overloads
Some of the RF addresses are “overloaded”, in that accessing the register has a side effect. Writing to
RF[255], for example, is used to send the ALU contents to an external device. RF[255] is also written.
RF[254] is used by the event timers (see below).
RF[253] is used to write the ALU value into the DDR controller address/command pipeline. This is used
during memory initialization, calibration, and refresh.
Similarly, one of the Function code points is overloaded. The “&&” Function replaces RFb with
InData[35:00] as the ALU “b” input.
Input and Output
The InData bus is used to read external data. The TC5 doesn’t need to read from many sources. Currently,
the InData signals are only used to read a character from the RS232 receiver, and to read from the Event
triggers (see below) and the XD bus. XD is a byte-wide bus used by the tester to send the low byte of the
192-bit XD register in the RAM tester to the TC5. When the tester is queried (see below), if an error has
occurred, the XD register, which contains the data read from memory (128 bits), the expected data (32
bits), and the address and error flags (32 bits). After an error, XD is read and right-shifted 24 times to read
the error data into the TC5, which it prints it to the console.
The Output register overload is used primarily to set and clear individual control bits used by other parts of
the system. The control bits are loaded when R[ 255] is written. Since many of the control bits are
Copyright © Microsoft 2007, 2008
13
unrelated, it is convenient to use R[255] as a “shadow register” that contains the current state of all the bits.
If we call register R[255] “Output”, then to set output bit X, execute:
Output <= aOutput | bBitX;
to clear bit X, execute:
Output <= aOutput &~ bBitX;
Register BitX contains the mask of the bit(s) to be set or cleared. This changes both the shadow register
and the output bits. Note that these statements make use of both the Rw overload (Output) and an overload
on Ra (aOutput). The Ra overload is supplied as a “field” definition in the assembler prologue that defines
the machine.
The Output bits that are used to operate the DDR controller (and the Tester, if it is present), are as follows:
ALU [00]: StartDQCal. This bit is toggled during calibration to start the calibration state machines.
ALU [01]: InhibitDDR. This bit is used during refresh to keep the controller from accepting user requests.
ALU [02]: DDRclockEnable. This bit enables the DDR DIMM clock.
ALU [03]: ResetDDR. This bit is toggled to reset the DDR controller.
ALU [04]: Start. This bit is toggled to start the Tester.
ALU[05]: SDA for I2C interface.
ALU[6]: SCL for I2C interface.
ALU [07]: Force. This bit is asserted during calibration to force the calibration data pattern during the
write done at the beginning of calibration.
ALU [09]: TxDn. This bit is the RS232 “Transmit Data” bit. RS232 transmission is done by toggling this
bit at the appropriate times to generate characters.
ALU [10]: KillRank3. This bit is set to keep the controller from issuing “Select” to DIMM rank 3. It is
used to force an error condition during testing.
ALU[11]: Select. Toggling this signal switches the RS232 and SDA/SCL signals between the two TC5s in
the dual controller.
ALU [28:27]: TestConf. These bits select the tester address (bit 27) and data (bit 28) generation
algorithms. If “0”, addresses/data are sequential. If “1”, they are taken from linear feedback shift registers.
This is used during testing.
Event Triggers
The TC5 doesn’t have interrupts, but it has something similar – Event Triggers. Triggers are of two types:
Timed triggers and simple event triggers. Timed triggers are used to generate periodic events such as the
transmission of an RS232 character one bit at a time, or refreshing the DDR2 memory. Simple triggers just
respond to single events, such as the receipt of an RS232 character. The TC5 currently has three event
triggers, two timed triggers for refresh and RS232 transmission, and one simple trigger for RS232
reception.
Copyright © Microsoft 2007, 2008
14
The triggers are implemented as a set of hardware registers. Each trigger consists of an “armed” bit, and a
“trigger” bit. A timed trigger also contains a register which is compared with a continuously-running 10-bit
counter (Now). When the counter value becomes equal to the register, the trigger bit is set if the trigger is
armed.
The motivation for adding triggers was that the earlier TC4 spent a lot of time in delay and polling loops.
While in such a loop, the TC4 couldn’t do anything else. Event triggers remove this limitation, and allow
several tasks to run quasi-concurrently. For the timed triggers, the timer value can be loaded with the sum
of the current value and the desired time until the next trigger. This means that periodic events will occur
with the correct period, subject only to jitter if some other service routine is running when the trigger
occurs. Since service routines are short, this hasn’t been a problem in practice.
The main loop of the TC5 shell loops on the “OR” of the trigger bits. When a trigger event occurs (either a
single event or because Now becomes equal to one of the timer registers) and that trigger is armed, the
associated trigger bit is set. The hardware generates a binary encoding of the highest priority pending
trigger, and returns it on InData[11:9]. When the program executes the “ski” or “skni” skip operation, the
highest priority pending trigger is cleared. The main loop does an indexed load of the PC of the service
routine for the trigger (stored in a table in RF), and jumps to it.
Triggers remain armed until explicitly disarmed, but in order to generate a periodic signal (e.g., an RS232
character), when the timer event occurs, the associated timer register must be updated to cause an event at
the original time plus the bit time.
To get the event time, the shell reads InData [9:0]. The contents of this field depend on the value of the
previous instruction’s ALU[1:0] field. If it is 0 or 1, Timer [LastALU [0]] is returned. If it is >= 2, the
value of Now is returned (this makes it possible to get the first timer period right during character
transmission).
To reload a timer, the WriteTimer Rw overload is used (Rw = 254). The timer to load is selected by the
previous instruction’s ALU field, as with reading. This value also determines whether the timer should be
armed or disarmed. If LastALU [2] = 0, the timer is disarmed. If it is 1, the timer is (or remains) armed.
The TinyAsm assembler
The assembler used to assemble for the TC5 is extremely simple. It only knows how to look things up in a
symbol table, and generate instructions. Instructions are written one per source line, and are terminated
with “;”. The assembler can be used to assemble for different variants of the hardware (all 36 bits in width)
by changing a prologue that appears at the beginning of the source file. Here is the prologue and the first
few instructions for the current version of the TC5 test program (TestTC5.txt):
field aoff
field boff
field woff
19 0;
10 0;
28 0;
Define the field offsets for rfref.
field instruction 0 0;
field rf 1 0;
Symbolic name for instruction memory
Symbolic name for register file
field
field
field
field
PC 0 0;
<= 0 0;
, 0 0;
InData 0 0;
Noise
Noise
Noise
Noise
field
field
field
field
field
field
field
+
&
&&
|
|~
^
The
The
The
The
The
The
The
0
1
2
3
4
5
6
7;
7;
7;
7;
7;
7;
7;
word
word
word
word
"plus" function
"minus" function
"and" function
"and InData" function
"or" function
"or not" (nor) function
"xor" function
Copyright © Microsoft 2007, 2008
15
field &~ 7
7;
field rcy1 1
field rcy9 2
field rcy18 3
field
field
field
field
field
field
field
skn 1
skz 2
ski 3
skp 4
skge 5
sknz 6
skni 7
The "and not" function
5;
5;
5;
2;
2;
2;
2;
2;
2;
2;
field ReadX 1 0;
field WriteX 2 0;
field Jump
3 0;
Skip
Skip
Skip
Skip
Skip
Skip
Skip
if ALU < 0
if ALU = 0
if InRdy
always
if ALU >= 0
if ALU # 0
if ~InRdy
Read RF, addressed by LastALU
Write RF, addressed by LastALU
LastALU is unchanged by Jump
;Rw overloads
field Output 255 28;
field aOutput 255 19;
field WriteTrig 254 28;
field DDRaddr
253 28;
;--------------End of machine description--------------------------mem instruction loc 1;
Set current memory to the IM, location 1.
; Integrated initializer and shell for the BEE3.
; Initialize the RAMs
Output aOutput | bTwo;
Inhibit DDR
Output aOutput | bEight;
Toggle ResetDDR
Output aOutput &~ bEight;
wDelay <= aPwrDly + bZero;
Wait 200 us.
Jump aDly + bZero, wRlink <= PC;
Output aOutput &~ bTwo;
Enable DDR
Output aOutput | bFour;
DIMM clock on
Jump aInitMem + bZero, wRlink4 <= PC;
Initialize memory
The assembler is line-oriented. A scanner breaks each line into a series of tokens, which are numbers,
reserved words, or strings. These are stored in the token array, which is then processed. A semicolon
causes the remaining characters in the line to be skipped, and positions the scanner at the start of the next
line after processing any tokens.
The token processor skips tokens that are undefined. Isolated numbers or strings that resolve to fields are
placed into the “Current value” (cv) word with the indicated bit offset. An isolated number uses an offset
of 0.
There are relatively few built-in reserved words:
1.) field declares a field. The preceding token is a string, and the next two are numbers. A symbol is
defined with key = string, value = first number, offset = second number. Token processing continues at the
third token after the field.
2 ) “mem String (which must evaluate to a number)” or “mem number” makes M[number] the current
memory. Usually the mem String form will be used, after a preceding field definition of the memory name.
Memories are numbered 0 to 2. Token processing continues after the string or number.
3.) “loc String (which must evaluate to a number)” or “loc number” sets the current location in the current
memory.
Copyright © Microsoft 2007, 2008
16
4.) “String:” Provides a way to label statements. Three symbols are created: One is “aString”, one is
“bString”, and the third is “wString”. These symbols have the same value – the index of the symbol in the
current memory, but have offsets corresponding to the “a”, “b”, and “w” fields of an instruction.
By convention, locations in IM (or RF) are labeled with initial small letters (e.g., “calibrate”), while a
constant in the register file is labeled with a capitalized first letter, but the rest identical:
Calibrate: calibrate;
To call “calibrate”, storing the return PC in Rlink4, do:
Jump aCalibrate + bZero, wRlink4 <= PC;
5.) “String noise” defines a noise word. These are ignored, but may be added to make the source clearer.
When token processing is finished, if any field in cv has been set, the value is placed into the current
location in the current memory, and the current location is incremented. cv is cleared and scanning resumes
at the start of the next line.
The assembler is two-pass, to provide for the resolution of forward references. During the first pass, the
location counter(s) are modified as above, but no code is placed in the memories. During the second pass,
all symbols should have been resolved, and memory contents are generated. When processing of the input
is complete, the three memories are written to the files needed for the FPGA configuration bitstream, and
the program exits.
The assembler also writes a listing file to the file “listing.txt” in the working directory. It is very important
to inspect this file after making source changes. The listing is annotated with program counter values for
the instruction memory, which are needed for debugging. It also summarizes (at the end) the number of
locations used in the memories (RF and IM), and the number of errors encountered. Errors are flagged in
the listing file at the point the error was detected. The most frequent error is “undefined symbol”, which
usually indicates a typo. There are a few other errors that are detected, but it is quite easy to write
programs that assemble without errors but which don’t work. Since the number of programs that will be
written for the TC5 is small (probably just one), this is considered acceptable.
The assembler can be run from a DOS command line, or from the Visual Studio debugger. I prefer the
latter method, since in the (rare) case that the program throws an exception, it is possible to figure out why.
Since the entire program is only about 250 lines of C#, it should be easy to understand.
The “TestTC5” shell
TestTC5 is a simple debugger for TC5. It also contains routines to control the memory tester, reset and
calibrate the DIMMs, and refresh the memory. It communicates with the (human) user over a 115,200
baud serial link of 8 data bits, 1 stop bit, no parity, and no flow control. The BEE3 control board brings out
the serial link as a USB port, so the host must have the appropriate driver. HyperTerminal is a useful utility
to run the shell.
When the BEE3 is reset or the FPGA is reconfigured, the TC5 starts execution at location 1. It resets the
system, waits for 200 us as required by the RAMs, initializes the RAMs on the four DIMMs, and arms the
event timers.
It then reads several ASCII strings from the eeprom on the FPGA’s real time clock and prints them. These
include: The model (“BEE3-LX155T-2C”, the serial number, the Ethernet MAC address, and the FPGA
identifier (“A”, “B”, “C”, or “D”).
It then attempts to calibrate the DDR2 DIMMs. If this succeeds, it prints:
Copyright © Microsoft 2007, 2008
17
Cal Done
Bee3 Shell
>
If calibration fails, it prints:
Cal Fail
Bee3 Shell
>
This is dire, and requires debugging, probably with ChipScope.
Once the shell is started, it listens for input. Decimal digits are assembled into the “current value” as they
are typed. Other characters recognized by the shell are:
“/”: This “opens” the location in the register file given by the current value, and prints the contents. Typing
“/” again uses the contents as an address, opens that cell, and prints the contents (not very useful)).
“>”: Opens the next location in the register file and prints its contents.
“<”: Opens the previous location in the register file and prints its contents.
“Enter”: This “closes” the currently open cell by storing the current value into it. To read the contents of
location 256 (say 3), change it to 20, and store it, the output is:
>256/00000000003 20 (enter)
>
“r”: This complements the radix used for printing. The shell echoes the new radix (“x” or “d”), and
numbers will then be printed in the new radix. Whenever the shell prints its prompt (“>”), the radix is reset
to decimal.
“g”: This jumps to the location given by the current value, saving a return PC in Rlink4.
The called routine can indicate “success” by doing a normal return, or it can indicate failure by returning to
aRlink4 + bOne (i.e, by skipping.). The shell prints “s” or “f”, prints “>” and returns to listening for input.
So typing:
>1g
Resets and reinitializes the system, and restarts the shell.
“s”: This toggles the Select signal, which causes the top level module to switch the RS232 and SDA/SCL
signals between the two TC5s in the dual controller. It is not echoed.
The shell simply skips over unrecognized characters without echoing them. There is no backspace facility,
but typing “/” “enter” if an input value is mistyped is usually safe. If you jump to a location above the end
of occupied instruction memory, the shell will restart, since the TC5 will happily increment the PC and roll
over to the initialization sequence. If you jump into a random location in occupied IM, anything can
happen.
Ram Tester
Copyright © Microsoft 2007, 2008
18
The top level Verilog module instantiates a tester for the RAM. User designs need not instantiate this tester,
but it is useful during hardware debugging.
The tester generates a stream of commands consisting of burstLength writes followed by burstLength
reads to the same locations. The data returned is compared with the expected value, and the tester stops
checking if an error is detected. BurstLength is taken from RF[257] when the tester is started. It defaults
to 31d (32 writes), but may be set to much larger values to check bit longevity in the RAMs (essentially,
that refresh works correctly).
RF[256] (which can be set from the shell) determines whether the address and data will be sequential or
taken from full-period linear feedback shift registers. It is sampled when the “t” command (see below) is
issued. The possibilities are:
0: Sequential address, sequential data. This is the default
1: Random address, sequential data.
2: Sequential address, random data.
3: Random address, random data.
Sequential addresses can run the RAMs at full speed, since they need to precharge banks infrequently.
Random addresses are slower, but test more of the controller logic.
There are three TC5 routines that control the tester. Because these are frequently executed operations, they
have shell commands to invoke them by typing a single character:
“t” Starts the tester.
“q” Queries the tester for errors. Since the TC5 program has many functions, it is not possible for it to poll
for failure, so the tester must be queried. If there have been no errors, the shell simply restarts and types
“>”. If an error was detected, the shell prints out six numbers. The first is the 28-bit address in the lowest
7 hex digits, and the value of the OddWord, HoldFail (indicating data not as expected) .Single Error, and
Double Error flags in the 8th hex digit. The second number is the expected data, and the next four numbers
are the data in each of the four 32-bit words making up a 128-bit data burst.
Note that the tester only gathers this information and stops on a data mismatch, or on a single or double
error. When the tester stops, the red LED on the BEE3 front panel is lit.
“e” Forces an error by disabling the generation of “Select” to the fourth rank of RAMs. The error won’t
happen immediately in sequential address mode, since it takes several seconds for the tester to read the
entire RAM.
Copyright © Microsoft 2007, 2008
19