Cache memory with dual-way arrays and multiplexed parallel output

Cache memory with dual-way arrays and multiplexed parallel output
US006594728B1
(12) United States Patent
(10) Patent No.:
Yeager
(54)
US 6,594,728 B1
(45) Date of Patent:
Jul. 15, 2003
CACHE MEMORY WITH DUAL-WAY
5,479,630 A
12/1995 Killian
ARRAYS AND MULTIPLEXED PARALLEL
OUTPUT
5,510,934 A
5,542,062 A
5,568,442 A
4/1996 Brennan et 91
7/1996 Taylor et al.
10/1996 Kowalczyk et al.
A
(75) Inventor: Kenneth C. Yeager, Sunnyvale, CA
(Us)
5/1997 Bratt et a1.
12/1997 Taylor et a1.
5,740,402 A
View CA (Us)
5,805,855 A
’
4/1998 Bratt et 81.
*
9/1998 Liu .......................... .. 711/108
5,870,574 A
-
( * ) Notice:
21997 JOIShiZtZL
5,632,025 A
5,699,551 A
(73) Assignee: MIPS Technologies, Inc., Mountain
-
11 1996 B tt t l.
A
-
-
-
5946 710 A
SIlbJGCI' to any disclaimer, the term of this
5:954j815 A
2/1999 Kowalczyk 61 a1.
*
8/1999
B
t
9/1999 1011;312:151‘ a
l. ........... ..
711/129
patent is extended or adJusted under 35
U.S.C. 154(b) by 0 days.
FOREIGN PATENT DOCUMENTS
EP
(21) Appl. No.: 08/813,500
W0
Related US. Application Data
0564813
10/1993
W0 85/o4737
10/1985
J. Cooke et al., The Evolution of RISC Technology at IBM,
pp. 4—36, IBM J. Res Develop., vol. 34, No. 1 (Jan. 1990).
Continuation Of application NO. 08/324,124,
On Oct.
14’ 1994’ now abandoned
(51)
Int. Cl.7 .............................................. .. G06F 12/08
(52)
711/127; 711/120; 711/128
(58)
Field Of Search ............................... .. 395/164, 430,
395/454, 455, 484, 250, 447, 415; 365/189.01,
189.02, 230.02, 230.04; 364/DIG. 1; 711/127,
128, 120
.
(56)
of European Search Report from European Patent
Application No. 959382474 (Note Citation above).
* Cited by examiner
Primary Exgminer—B_ James Peikari
Assist/mt Examiner_J. Peikari
(57)
ABSTRACT
A tWo-Way cache memory having multiplexed outputs and
References Clted
alternating Ways is disclosed. Multiplexed outputs enable
us PATENT DOCUMENTS
the cache memory to be more densely packed and imple
mented With feWer sense ampli?ers. Alternating Ways enable
3,967,247 A
6/ 1976 Andersen et al- ---- -- 364/222-81
tWo distinct cache access patterns. According to a ?rst access
4,381,541 A *
4/1983 Biiumanm 1L ct a1~
711/128
pattern, tWo doubleWords in the same Way may be accessed
/
into main memory. According to a second access pattern,
tWo doubleWords in the same location but in different Ways
4,616,310 A * 12/
A
5’027’270 A
PDIHFL‘II' """"""""" "
8x990
6/1991 Riordan et aL
simultaneously. Such access facilities the leading of data
""""" "
'
5’O63’533 A * 11/1991 Ethan et aL ______________ __ 711/157
may be accessed simultaneously. Such access facilitates the
51113306 A
loading a particular Word into a register ?le.
5/1992 Moussouris er a1,
5,212,780 A *
5/1993 Padgaonkar et a1.
5,253,203 A
* 10/1993
5,289,584 A
*
5,307,477 A
5,386,533 A
Partovi et al.
365/230.04
....... ..
365/18902
-
2/1994 Thome 61 a1. ............ .. 711/109
-
17 Clams’ 13 Drawmg Sheets
4/1994 Taylor et a1.
_
_
*
1/1995 Morris ..................... .. 711/103
Mlcro?che AppendlX Included
5,442,748 A *
8/1995 Chang et a1. ............. .. 395/164
(1 Micro?che, 48 Pages)
502
FLT
EYF?S
"REE-‘ER _ A390 418
422
BASE
INTEGER
REGISTER ,
HLE
/'
JILB
1109a Acu
—
30s
STORE
AUGNER
ADD
QUEUE
6
ADDRESS
STACK
30B
420 j
“0
U.S. Patent
Jul. 15,2003
Sheet 2 0f 13
US 6,594,728 B1
__8m
"I
Eggas.1 5%|LI
O3\,_Us5%
"
“I
$202Q56EI
\
n
_mag
Q
Q:
E5a.|
5\
\m <m
a::2E;Emma
r
.EIK\
$5 : a;
“
8»s5$202_
5%_
_5%22
2%5
Sn \aim
3+
\\
55__5m|\
$8I2
|\ 8m
n
\
_ |.__|I
\
_
_M5622
U.S. Patent
Jul. 15,2003
Sheet 5 0f 13
/459u
ARRAYO
/459b
ARRAYI
oouauzwonoo ,4s5
WAY 0
DOUBLEWORD l
WAYI
A56
uouauzwonoz ,451
WAYO
WAYO
oousuzwonnz
DOUBLEWORDB
WAYI
“
DOUBLEWORDO
WAY!
DOUBLEWORDI
WAYI
DOUBLEWORDS r458
460
US 6,594,728 B1
WAYO
460b
oousuawono 2
WAYO
46|b
DOUBLEWORD a
WAYO
46k]
‘
DOUBLEWORDS
DOUBLEWORD3
WAY!
WAY o
FIG. 5
U.S. Patent
Jul. 15,2003
Sheet 13 0f 13
US 6,594,728 B1
$2395;em
IoAL
wElohln
I1IlI'|.
I_|IJ‘
E85
\\com
H 2.
_
_
3 28
2Q28
EQBN
38 5 32.08$82 583 35218 8: 58
US 6,594,728 B1
1
2
As can been seen, the higher the cache “hit” rate is, the
CACHE MEMORY WITH DUAL-WAY
ARRAYS AND MULTIPLEXED PARALLEL
OUTPUT
faster the CPU can perform its duties. obviously, the larger
the cache, the more data it can store, and thus, a higher
probability of a hit. HoWever, in the real World, micropro
cessor designers are alWays faced With siZe constraints due
This is a continuation of application No. 08/324,124 ?led
Oct. 14, 1994, noW abandoned.
to the fact that as there is limited available space on a die.
A preferred embodiment of the present invention is
Using a larger die siZe, although effective, is not practical
incorporated in a supercalar processor identi?ed as
since the cost increases as die siZe increases. Further,
“R10000, ” Which Was developed by Silicon Graphics, Inc.,
of Mountain VieW, California. Various aspects of the
R10000 are US. Ser. Nos. 08/324,128, 08/324,129 and
10
other functional units of the CPU.
Thus, there is a need for designing a cache that can
determine if a hit has occurred using a minimum number of
cycles and a high hit rate While reducing space needed on the
08/324,127, all incorporated herein by reference for all
purposes. The R10000 is also described is J. Heinrich, MIPS
R10000 Microprocessor User’s Manual, MIPS
Technologies, Inc. (1994).
reducing the siZe of the cache Without reducing the perfor
mance alloWs the designer to improve the performance of
15
chip.
MICROFICHE APPENDIX
SUMMARY OF THE INVENTION
A micro?che appendix containing one sheet and forty
eight frames is included as Appendices I and II to this
application and is hereby incorpated by reference in its
20
entirety for all purposes. The micro?che appendix is directed
to Chapters 16 and 17 of the design notes describing the
R10000 processor.
BACKGROUND OF THE INVENTION
25
The present invention offers a highly ef?cient mechanism
for implementing cache memory in a computer system. This
mechanism enables the cache memory to have high a “hit”
rate, fast access time, loW latency, and reduced physical siZe.
In one embodiment, the present invention provides a
cache Which operates in parallel With the translation looka
side buffer to reduce its latency. The cache contains tWo
2-Way set-associative arrays that are interleaved together.
This invention relates in general to computers and in
particular, to cache memory.
Each 2-Way set-associative array includes tWo arrays, one
CPU designers, since the inception of computers, have
operating cache arrays, up to four instructions can operate
simultaneously. The bits in each data array are interleaved to
alloW tWo distinct access patterns. For example, When the
cache is loaded or copied back, tWo double Words in the
same block are accessed simultaneously. When the cache is
each for the tag and data. By having four independently
been driven to design faster and better processors in a
cost-effective manner. For example, as faster versions of a
30
particular CPU becomes available, designers Will often
increase the CPU’s clock frequency as a simple and cost
effective means of improving the CPU’s throughput.
After a certain point, the speed of the system’s main
memory (input/output) becomes a limiting factor as to hoW
35
read, the same doubleWord location is simultaneously read
from both blocks With the set. Further, by using a
multiplexer, the number of sense ampli?ers for reading and
fast the CPU can operate. When the CPU’s operating speed
Writing are reduced, thereby saving signi?cantly valuable
exceeds the main memory’s operating requirements, the
space on the die.
Abetter understanding of the nature and advantages of the
CPU must issue one or more Wait states to alloW memory to
catch up. Wait states, hoWever, have a deleterious effect on
CPU’s performance. In some instances, one Wait state can
40
present invention may be had With reference to the detailed
description and the draWings beloW.
decrease the CPU’s performance by about 20—30%
Although Wait states can be eliminated by employing
faster memory, it is very expensive and may be impractical.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 discloses a functional block diagram of a super
Typically, the difference betWeen the price of a fast 45 scalar processor;
FIG. 2 discloses a functional block diagram of a load/store
memory chip and the next fastest speed grade can range
unit;
from 50—100%.
Thus, the cost can be quite prohibitive, especially for a
FIG. 3 discloses a block diagram of a cache bank;
FIG. 4 discloses a block diagram of a cache data array and
system requiring a large memory.
Acost effective solution has been to provide the CPU With
a hierarchical memory consisting of multiple levels of
memory With different speeds and siZes. Since the fastest
memories are more expensive per bit than sloWer memories,
they are usually smaller in siZe. This smaller memory,
referred to as a “cache”, is closely located to the micropro
50
FIG. 5 discloses the block organiZation Within each bank
of the cache data array;
FIG. 6 illustrates the bit arrangement Within each bank of
the data cache;
55
cessor or even integrated into the same chip as the micro
FIG. 8 discloses the connection betWeen the tWo banks of
Conceptually, the memory controller retrieves instruc
tions and data that are currently used by the processor and
the cache;
60
tions or data, it ?rst checks the cache. The control logic
determines if the required information is stored in the cache
(cache hit). If a cache hit occurs, the CPU does not need to
access to main memory. The control logic uses valuable
cycles to determine if the requested data is in the cache.
HoWever, this cost is acceptable since accesses to main
memory is much sloWer.
FIG. 7 discloses a logic diagram of the cache control
logic;
processor.
stores them into the cache. When a processor fetches instruc
control logic;
65
FIG. 9 discloses a block diagram of a cache tag array and
control logic;
FIG. 10 discloses bit ?elds of the tag;
FIGS. 11A—11B disclose the tag check logic;
FIG. 12 discloses a logic diagram for generating a cache
hit pulse; and
FIG. 13 discloses a block diagram of the roW decoder for
the cache tag array.
US 6,594,728 B1
3
4
DESCRIPTION OF THE PREFERRED
EMBODIMENT
along that path. OtherWise, the decision must be reversed, all
speculatively decoded instructions must be aborted, and the
program counter and mapping hardWare must be restored.
Contents
Referring again to FIG. 1, mapping tables 204 and 206
support three general pipelines, Which incorporate ?ve
I. Superscalar Processor Architecture
A. Superscalar Processor Overview
execution units. A ?oating-point pipeline is coupled to
B. Operation
?oating-point mapping table 204. The ?oating-point pipe
II. Load/Store Unit
A. Load/Store Unit OvervieW
line includes a sixteen-entry instruction queue 300 Which
communicates With a sixty-four-location ?oating point reg
ister ?le 302. Register ?le 302 and instruction queue 300
feed parallel multiply unit 400 and adder 404 (Which
III. Data Cache
A. Data Cache OvervieW
performs, among other things, comparison operations to
con?rm ?oating-point branch predictions). Multiply unit
B. Data Array
1. Data Array Organization
2. Data Array Control Logic
C. Tag Array
1. Tag Array Organization
2. Tag Array Control Logic
400 also provides input to a divide unit 408 and square root
unit 410.
15
integer instruction queue 304 Which communicates With a
D. Cache Interface
sixty- four-location integer register ?le 306. Register ?le 306
I. Superscalar Processor Architecture
and instruction queue 304 feed arithmetic logic units
(“ALU”) ALU#1 412 (Which contains an ALU, shifter and
FIG. 1 discloses a functional block diagram of a super
scalar processor 100 Which incorporates a cache memory in
integer branch comparator) and ALU#2 414 (Which contains
an ALU, integer multiplier and divider).
Third, a load/store pipeline (or load/store unit) 416 is
coupled to integer mapping table 206. This pipeline includes
accordance With the present invention. Processor 100, Which
generally represents the R10000 Superscalar Processor
developed by Silicon Graphics, Inc., of Mountain VieW,
Calif., provides only one example of an application for the
25
cache memory of the present invention.
A. Superscalar Processor OvervieW
one instruction in parallel. Processor 100 fetches and
decodes four instructions per cycle. Each decoded instruc
tion is appended to one of three instruction queues. These
queues can issue one neW instruction per cycle to each of
?ve execution pipelines.
35
address entries for address stack 420. These virtual
addresses are converted to physical addresses in joint trans
lation lookaside buffer (J TLB) 422 and used to access a data
cache 424.
Data input to and output from data cache 424 pass through
store aligner 430 and load aligner 428, respectively. Address
stack 420 and data cache 424 also communicate With
external hardWare controller and interface 434. Further, data
cache 424 and controller/interface 434 communicate With
in stage 3; and instruction execution is performed in stages
4—7.
Referring to FIG. 1, a primary instruction cache 102 reads
four consecutive instructions per cycle, beginning on any
Word boundary Within a cache block. A branch target cache
104, instruction register 106, instruction decode, and depen
dency logic 200, convey portions of issued instructions to
?oating point mapping table 204 (32 Word by 6 bit RAM) or
integer mapping table 206 (33 Word by 6 bit RAM). These
a sixteen-entry address queue 308 Which communicates With
register ?le 306. The architecture of address queue 308 is
described in detail in commonly-oWned, co-pending patent
application, Ser. No. 08/324,128, Which is hereby incorpo
rated by reference in its entirety for all purposes.
Register ?le 306 and address queue 308 feed integer
address calculate unit 418 Which, in turn, provides virtual
Asuperscalar processor can fetch and execute more than
The block diagram of FIG. 1 is arranged to shoW the
stages of an instruction pipeline and illustrates functional
interconnectivity betWeen various processor elements.
Generally, instruction fetch and decode are carried out in
stages 1 and 2; instructions are issued from various queues
Second, an integer pipeline is coupled to integer mapping
table 206. The integer pipeline includes a sixteen-entry
secondary cache 432.
B. Operation
Processor 100 uses multiple execution pipelines to over
45
lap instruction execution in ?ve functional units. As
described above, these units include the tWo integer ALUs
412, 414, load/store unit 416, ?oating-point adder 404 and
?oating-point multiplier 400. Each associated pipeline
tables carry out a “register renaming” operation, described in
detail beloW, Which renames logical registers identi?ed in an
includes stages for issuing instructions, reading register
operands, executing instructions, and storing results. There
instruction With a physical register location for holding
values during instruction execution. A redundant mapping
are also three “iterative” units (i.e., ALU#2 414, ?oating
point divide unit 408, and ?oating-point square root unit
410) Which compute more complex results.
Register ?les 302 and 306 must have multiple read and
Write ports to keep the functional units of processor 100
busy. Integer register ?le 306 has seven read and three Write
ports; ?oating-point register ?le 302 has ?ve read and three
Write ports. The integer and ?oating-point pipelines each use
mechanism is built into these tables to facilitate ef?cient
recovery from branch mispredictions. Mapping tables 204
and 206 also receive input from a ?oating point free list 208
(32 Word by 6 bit RAM) and an integer free list 210 (32 Word
55
by 6 bit RAM), respectively. Output of both mapping tables
is fed to active list 212 Which, in turn, feeds the inputs of free
lists 208 and 210.
Abranch unit 214 also receives information from instruc
tWo dedicated operand ports and one dedicated result port in
the appropriate register ?le. Load/Stote unit 416 uses tWo
tion register 106, as shoWn in FIG. 1. This unit processes no
dedicated integer operand ports for address calculation.
more than one branch per cycle. The branch unit includes a
Load/Store unit also loads or stores either integer or
?oating-point values via a shared Write port and a shared
branch stack 216 Which contains one entry for each condi
read port in both register ?les. These shared ports are also
used to move data betWeen the integer and ?oating-point
tional branch. Processor 100 can execute a conditional
branch speculatively by predicting the most likely path and
decoding instructions along that path.
The prediction is veri?ed When the condition becomes
knoWn. If the correct path is taken, processing continues
65
register ?les.
In a pipeline, the execution of each instruction is divided
into a sequence of simpler operations. Each operation is
US 6,594,728 B1
6
5
performed by a separate hardware section called a stage.
Each stage passes its result to the next stage. Usually, each
Load/Store Unit 416 facilitates data transfer instructions
betWeen microprocessor’s register ?les, data cache 424, and
main memory such as “loads”, “stores”, “prefetch”, and
“cache” instructions.
Normally, main memory is accessed via data cache 424.
instruction requires only a single cycle in each stage, and
each stage can begin a neW instruction While previous
instructions are being completed by later stages. Thus, a neW
instruction can often begin during every cycle.
Pipelines greatly improve the rate at Which instructions
Data cache greatly improve memory performance by retain
ing recently used data in local, high speed buffer memories.
Microprocessor 100 also includes secondary cache 432 to
augment the data cache. Depending on availability of space,
can be executed. HoWever, the ef?cient use of a pipeline
requires that several instructions be executed in parallel. The
result of each instruction is not available for several cycles
after that instruction enters the pipeline. Thus, neW instruc
tions must not depend on the results of instructions Which
are still in the pipeline.
Processor 100 fetches and decodes instructions in their
original program order but may execute and complete these
instructions out of order. Once completed, instructions are
10
15
secondary cache 432 may be implemented as a separate
chip.
All “cached” operations ?rst access data cache 424. If the
data is present therein (a “cache hit”), a load can be
completed in tWo cycles. OtherWise, access to secondary
cache is initiated. If it “hits”, the primary cache is “re?lled”,
and a load takes at least 8 cycles. OtherWise, main memory
must be read and both caches re?lled. In such cases, a load
“graduated” in their original program order. Instruction
fetching is carried out by reading instructions from instruc
Would take signi?cantly longer.
tion cache 102, shoWn in FIG. 1. Instruction decode opera
order”. An instruction’s address can be calculated as soon as
Microprocessor 100 executes cached operations “out-of
tion includes dependency checks and register renaming,
performed by instruction decode and dependency logic 200
its index registers are valid, even if previous instructions are
and mapping tables 204 or 206, respectively. The execution
Waiting for their index registers to become valid. Cache
misses do not block later instructions (“non-blocking”); the
units identi?ed above compute an arithmetic result from the
operands of an instruction. Execution is complete When a
unit can begin neW operations While as many as eight cache
misses are processed.
result has been computed and stored in a temporary register
25
“Uncached” operations bypass the caches and alWays
identi?ed by register ?le 302 or 306. Finally, graduation
access the system bus. Typically, uncached operations
commits this temporary result as a neW permanent value.
access input/output devices or special-purpose memories.
Uncached operations are executed only When the instruc
tion is about to graduate. Uncached operations must be
An instruction can graduate only after it and all previous
instructions have been successfully completed. Until an
instruction has graduated, it can be aborted, and all previous
performed sequentially in original program order because
folloWing any exception. This state is restored by “unnam
they cannot be undone in the event of an exception. Both
uncached Writes and reads may alter the state of the I/O
ing” the temporary physical registers assigned to subsequent
subsystem. Uncached operations are kept in the address
register and memory values can be restored to a precise state
instructions. Registers are unnamed by Writing an old des
tination register into the associated mapping table and
35
returning a neW destination register to the free list. Renam
ing is done in reverse program order, in the event a logical
register Was used more than once. After renaming, register
Buffer, but no load bypass can occur. When the store
graduates, the buffered data is transferred to the external
interface.
Although uncached operations are delayed until they
graduate, cached operations may proceed out of order. That
is, subsequent cached loads may be executed, and cached
stores may initiate tag check and cache re?ll operations.
?les 302 and 306 contain only the permanent values Which
Were created by instructions prior to the exception. Once an
instruction has graduated, hoWever, all previous values are
lost.
Active list 212 is a list of “active” instructions in program
order. It records status, such as Which instructions have been
completed or have detected exceptions. Instructions are
appended to its bottom When they are decoded. Completed
instructions are removed from its top When they graduate.
II. Load/Store Unit
A. Load/Store Unit OvervieW
Microprocessor 100 uses register ?les 306 and 302 to
stack, but no dependency checks are performed for them.
The operand of an uncached store is copied into the Store
Prefetch instructions are used to fetch memory blocks into
the primary and secondary caches. They are used to increase
45
performance by reducing delays required to re?ll caches, but
they have no effect on the logical execution of the program.
Prefetch instructions can signi?cantly improve programs
Which have predictable memory accesses but have a high
cache miss ratio. HoWever, improper use of prefetch instruc
tions can reduce performance by interfering With normal
store integer and ?oating point register values, respectively.
memory accesses.
As such, the Width of each ?le is equal to the Width of
There are tWo formats of prefetch instructions, so that
microprocessor’s data path. Since physical registers are used
either “base+offset” (PREF, opcode 63 octal) or “base+
index” (PFETCH, opcode 23 function 17 octal) addressing
also to store tentative results for instructions Which are
completed but not yet graduated, register ?les 302 and 306
55
should contain more physical registers than there are logical
may be used. These instructions are de?ned in C. Price,
MIPS RlOOOO-MIPS IV ISA Manual, MIPS Technologies,
Inc. (1994). Each format includes a 5-bit “hint” ?eld Which
registers. In one embodiment, register ?les contain 64 physi
cal registers, tWice the number of logical registers. Multiple
indicates What prefetching operation is expected. HoWever,
read and Write ports are provided for each register ?le to
alloW data to be read and Written from microprocessor’s
various functional units in parallel.
the architecture alloWs any hardWare implementation to
ignore the hint ?eld or the entire instruction because it does
not affect the program’s result. If any problem is
encountered, prefetch instructions are aborted Without gen
Primary instruction cache 102, data cache 424, and branch
erating any exceptions.
stack 216 are interconnected by a data path. To minimiZe the
necessary Wiring, the functional units share the data path.
Sharing data paths creates bus contention. This problem is
alleviated by employing a tWo phase-multiplexed unidirec
tional data paths to interconnect the functional units.
65
Prefetched data is loaded into both the secondary and
primary data caches. The “hint” ?eld applies to both caches.
If the external interface is busy When the address queue
executes a prefetch instruction, the queue Will retry that
US 6,594,728 B1
7
8
instruction later. However, if the addressed cache set cannot
be re?lled due to a dependency lock or previous re?ll
TABLE I-continued
operations, the instruction causes no action.
Microprocessor 100 uses only the loW three bits of the
hint ?eld. If bit 0 is set, the cache Will request an exclusive
copy of the cache block, Which can be Written. OtherWise,
the cache Will request a shared copy of the cache block. If
bit 2 is set, bit 1 selects Which Way is re?lled if there is a
cache miss. If the selected Way is locked or already in re?ll
in the primary data cache, the prefetch instruction causes no
action.
Prefetch instructions are decoded, queued, issued, and
tag-checked like other memory operations. But the prefetch
is marked “done” after only a single tag check cycle. If there
is a cache hit (With the required Write permission), the
instruction is complete. If there is a miss and a cache block
Address Queue Instruction Fields
Bits Field
7 AQvFunc
Description
Instruction opcode and function:
Onnnnnn
6-bit major opcode (modi?ed during
10nnnnn
instruction predecode), or
6-bit function code from COPIX opcode
11 fff cc
5-bit subfunction code for CACHE operations
AQvImm
16-bit immediate ?eld contains instruction
(AQ gets codes #OO—#37 octal only.)
10
(3-bit function, 2-bit cache select.)
16
bits [15:0].
Base Register:
15
6 AQvOpSelA
Operand A, select physical register# in
Integer Register File.
1 AQvOpRdyA
1 AQvOpValA
Operand A is ready for address calculation.
Operand A is vaiid for address calculation
is available, a cache re?ll operation is requested from the
external interface. The re?ll status is recorded in the cache
tag. HoWever, because the entry in the address stack is
“done”, it Will not Wait for the re?ll to be completed.
(Integer register# is not Zero.)
Index Register, or Integer Operand:
The Load/Store unit can speculatively execute instruc
6 AQvOpSelB
tions. These must be aborted in case a branch is reversed. If
an aborted instruction imitated a cache re?ll, the re?ll
(For integer stores, this 6 bit value is duplicated in
operation must be completed. These re?lls are called
AQvOpSelC.)
“orphans” because they no longer correspond to any instruc
tion in the processor. Prefetch instructions also create
orphans because they are “done” as soon as the initial tag
25
1 AQVOpRdyB
1 AQvOpValB
6 AQvOpSelC
The address tag of this cache block remains in a “re?lling”
state until the cache re?ll has been completed. If a subse
quent instruction addresses this cache block, it can use this
block, initially in a “Wait on re?ll” state.
Operand C, select physical register# in Flt. Pt.
Register File. (For integer stores, this ?eid
contains a copy of AQvOpSelB.)
AQvOpRdyC
AQvOpValC
AQvDest
B. Operation
AQvDType
Referring to FIG. 2, Load/store Unit 416 comprises an
address queue 308, address stack 420, address calculate unit
Operand C is ready.
Operand C is valid.
Destination, select physical register#.
Destination type (or hint):
OO=N0 destination register.
(If prefetch instruction,
hint=“shared”.)
35
O1=No designation register.
lookaside buffer (JTLB) 422, and data cache 424. Data cache
424, being a set-associative data cache, comprises cache tag
array 650, cache data array 610, and the tag check logic 660.
The TLB and data cache are con?gured to operate in parallel
(If prefetch instruction,
hint=“exclusive”.)
10=Integer destination register.
11=Floating-point destination register.
so as to reduce the latency for load instructions, translating
4 AQvUseR
Which ports of the shared register ?les are
required to execute this instruction?
Bit 3: Flt.pt. Write.
Bit 2: Flt.pt. Read.
Bit 1: Integer Write.
Bit O: Integer Read.
1
1
AQvStore
AQvFlt
This instruction is a store.
This instruction loads or stores a floating
1
AQvFltHi
to about 15% performance improvement in operating speed.
Address queue 308 communicates With register ?le 306
and address calculate unit 418. Address queue, containing 16
entries organiZed as a circular ?rst-in ?rst out (FIFO) buffer,
keeps track of all memory instructions such as loads, stores,
and “Cache” instructions that manipulate any of the caches.
Any time a memory instruction is decoded, it is allocated to
the next sequential entry at the bottom of the queue. When
an instruction graduates, it is deleted from the top of list.
Operand B is ready.
Operand B is valid. (Integer register# is not Zero.)
Floating Point Operands
cycle is completed.
(ACU) 418, store aligner 430, load aligner 428, translation
Operand B, select physical register# in
Integer Register File.
45
point register.
Load or store high half of floating-point
register (if FR=O).
Graduation occurs if the instruction completes Without an
When the operands for a memory instruction are
error and all previous instructions have graduated. Instruc
tions are graduated in the original program order even
though they may not have been executed in order.
Each entry in the address queue comprises several instruc
operations, ACU receives base register and index register
queue can be found in commonly oWned and co-pending
via Register File 306 and an immediate value directly.
available, address queue issues it for execution by sending
the necessary operands to the ACU 418. For “Indexed”
operands from Register File 306. As for other load or store
tion ?elds Which are exempli?ed in Table I. A more detailed
description of the contents and structure of the address 55 instructions, address queue provides a base register operand
ACU 418 calculates a virtual address corresponding to the
patent application Serial No. 08/324,129.
operands it received during the previous cycle. As discussed,
data cache is virtually indexed and physically tagged. As
TABLE I
such, the virtual address is divided into tWo portions, the
“tag” and “index”. The index of the virtual address is passed
Address Queue Instruction Fields
Bits Field
1
AQvActiveF
5
AQvTa g
to the cache to determine Which cache location to access
While the TLB 422 translates the virtual address into a
physical address or real page address. The architecture of the
Description
Entry is active. (Decoded from queue pointers,
delayed one cycle.)
Active List tag uniquely identi?es this
instruction Within the pipeline.
TLB and virtual memory is discussed in detail in commonly
65
oWned and co-pending application Ser. No. 08/324,128. The
real page address, referred to as the tag or physical tag, is
stored in cache Tag array 650.
US 6,594,728 B1
9
10
While TLB is translating the virtual address into a physi
cal address, the virtual index is used to access data and tag
in data cache data array and cache tag array, respectively. In
this manner, the physical address, tag, and data are available
tWo sub-arrays or Ways to provide an additional location into
Which main memory addresses With shared index bits can be
mapped. This decreases thrashing in the cache and improves
the hit rate Without having to increase the siZe of the cache.
Sub-arrays for tag array 650 are referred to as Way “0” and
simultaneously. Tag check 660 compares the tag and physi
Way “1”; those for the data array are referred to as sub- array
cal address and, if there is a match, generates a hit signal to
indicate that the requested data is present in the data cache.
0 and sub-array 1. Thus, tag array can access tWo tags (tag
0 and tag 1) in parallel and each data array can access tWo
The requested data are aligned by load aligner 428 according
to the loWer address bits and then Written to its destination
in register ?les 302 or 306.
10
A store path is provided for Writing register values from
register ?les 302, and 306 into data cache 424. Store aligner
430 aligns the data to be Written into the cache. Bypass path
601 enables data for uncached operations to bypass the data
cache.
the cache, if any, contains the desired data. The Way is
remembered and used later for graduating stores or for
external re?ll or Writeback operations.
15
Additionally, a bypass path 390 is provided to improve
performance of the load/store unit. Bypass path 390 alloWs
address a block Within an array, are decoded to select one of
256 “Word lines”. Each Word line contains 8 doubleWords or
Written. For example, the result of an execution unit can be
tWo blocks. The bits are interlaced so that doubleWords
Within these blocks are accessed differently for processor or
multiplexed directly into its operand registers so that a
dependent instruction can be executed While that result is
external interface operations. The processor associatively
accesses doubleWords With tWo blocks. On the other hand,
25
external interface accesses tWo doubleWords Within the same
block.
Separate address multiplexers are provided for data and
tag arrays. Multiplexer 621 selects address inputs from
among external interface, address queue, address stack, and
Microprocessor uses address stack 420 to store physical
memory corresponding to each instruction in the address
ACU for use by data array 610. As for the tag array,
multiplexer 620 selects address inputs from external
interface, address stack, or ACU. Select signals (select tag
and select data) for multiplexer 610 and 620 are generated
by address queue and decoded by cache control logic in
queue. Consequently, address stack is implemented With the
same number of entries as address queue. Data are loaded
into the address stack 420 during address calculation
sequence. The address stack is described in detail in com
monly oWned and co-pending patent application Serial No.
The arrays are “virtually indexed” using the index portion
(bits 13:0) of the virtual address. Bit 5 selects bank #0 or
bank #1. Bits 2:0 select a byte Within a doubleWord and bits
4:3 select a doubleWord Within a block. Bits 13:6, Which
data to circumvent the register ?les or memory When micro
processor is reading a location during the same cycle it is
Written into the register ?le. This bypass is selected When
ever the operand register number equals the previous
instruction’s destination register number.
The physical tag and virtual index are also Written into
address stack 420, Which is logically part of the address
queue but physically separate due to layout considerations.
doubleWords (ar0data and ar1data) in parallel. For CPU and
external “interrogate” operations, Tag0 and tag1 are checked
in parallel (read and compared) to determine Which Way of
35
08/324,129.
order to determine Which functional unit is controlling-each
array, as shoWn in Table II.
III. Data Cache
A. Data Cache OvervieW
TABLE II
The speci?cation and operations for the data cache is
included as Appendices I and II.
Data cache 424 is used to load and store instructions that
access “cacheable” regions of main memory. The data cache
424 is interleaved With tWo identical banks, bank “0” and
bank “1”.
Referring to FIG. 3 each bank comprises a tag cache array
650 and cache data array 610. The tag array stores the tag
associated With a block of data in the data array. The data
array, on the other hand, retains recently used memory data.
In one embodiment, microprocessor 100 employs a 32
Data Cache Index Address
Input Address
45
Data
Enable
Select
(Virtual)
Width Description
0
00
(none)
—
PoWer doWn. Do not enable
1
1
O0
O1
ACAdr[13:3]
ASIndex[13:3]
64
64
any Word line or ampli?er.
Address calculation.
Retry tag check or load
using address from
Address Slack.
1
1O
ASStore[13:3 ]
64
Address for Writing
data into the data array
during graduation of a
store instruction. (Not
K-byte data cache. As such, each data array comprises 16
K-byte divided into 256 roWs or Word lines, each containing
tWo blocks of 4 doubleWords (8 Words). Each doubleWord
used for tag arrays.)
1
11
ExtIndex[13.4]
128
Address from external
has 64 bits plus 8 parity bits for a total of 72 bits. The data
interface, for re?ll
array can access tWo doubleWords in parallel. As for tag
array 650, it has 64 roWs of 35 bits each and can access tWo 55
or interrogate.
32 bit tags in parallel.
Separate data multiplexers are also provided for tag and
Bank 0 and bank 1 operate independently and are
accessed depending on the values of the virtual index. The
tag and data arrays are allocated among requests from
address queue, address stack, ACU, and external interface.
Some instructions alloW the tag and data array to operate
data arrays. Multiplexer 625 selects an address from either
the external interface or JTLB to Write to the tag array.
Multiplexers 630 and 631 select among data from external
interface or register ?les for Writing into the data array.
independently. For example, store instructions require only
address selection value of “00”, it is poWered doWn. In
poWer-doWn mode, the address decoder is disabled so that it
does not select any Word line; the dummy Word line is
the tag array, thus leaving the data array free. Thus, the four
functional units can conceivably operate simultaneously if
they are allocated the cache array(s) they need.
Each bank is 2-Way set-associative. In effect, cache data
array 610 and cache tag array 620 are each sub-divided into
When a cache section is not used as indicated by an
65 disabled so that none of the sense ampli?ers turn on. The
decoder is disabled using an extra input in the ?rst level of
gates.
US 6,594,728 B1
11
12
In some embodiments, the data cache is not initialized by
hardware during reset. These embodiments require a boot
strap program to initialize all tags to the “Invalid“ state
495, Which Writes the ?rst doubleWord during 61 of the clock
cycle and the second doubleWord during Q52 of the same
clock cycle. Parity checker 478 checks all bytes of the
before it makes any use of the data cache. If a block is
“Invalid”, the contents of its data array are not used.
doubleWord for proper parity, and if any error occurs, the
processor takes an “External Interface Data Cache Error”
B. Data Array
exception.
1. Data Array Organization
Data are Written into the cache arrays either from the
Referring to FIG. 4, data array 610 contains 256 roWs.
Decoder 450 selects Which roW is accessed by driving its
“Word line”. high. Each Word line drives a roW of 576 cells
(512 data bits and 64 parity bits) or bits and is gated and
10
buffered every 32 cells. The 576 bits equals one cache set
With each Way containing 4 doubleWords (a Word equals 4
at the bottom of each column of cells. These drivers are
bytes).
Each-bit, as represented by 460, in a doubleWord contains
eight cells 461a—461h. The cells correspond to one bit from
each of the four doubleWords in each Way of the cache. The
number on the top of each cell represents the cache Way; the
number on the bottom represents the doubleWord number
Within the Way.
enabled according to the address being Written. An even
15
array 1.
is appended to each byte stored in the data cache. Each parity
bit is Written Whenever its byte is Written.
As previous discussed, data cache arrays are organized for
tWo distinct access patterns. For loading or copying a block
to secondary cache, the tWo data arrays access a pair of
25
Microprocessor 100 uses different select signals S0 and
S1 for processor and external interface accesses due to their
different access patterns. Select signal for external interface
operations uses virtual address bit 4:3 and Way bit;
CPU accesses use virtual address bits 4:3. Access pattern
block 1) Which belong to Way 0 and Way 1. Within each
TABLE III
Sense Ampli?er Multiplexer in Data Cache
35
CUP Accesses (64-bits, Associative)
SO (Array 0)
S1 (Array 1)
00
01
10
11
1000
0100
0010
0001
1000
0100
0010
0001
simultaneously, as indicated by lines 460a and 460b in FIG.
5. HoWever, When the cache is read (e.g., CPU access), the
same doubleWord location is simultaneously read from both
blocks Within the set, as indicated by lines 461a and 461b in
FIG. 5. The addressed data is selected by tag circuitry.
Accordingly, doubleWords 0 in either array. For block 0,
External Interface (128-bits in one Way)
0
1
0
1
block, the four dataWords alternate betWeen the tWo data
arrays. This permits tWo distinct access patterns. When the
cache is loaded or copied back (e.g., external interface
access), tWo doubleWords in the same block are accessed
Adr[4:3]
0
0
1
1
adjacent doubleWords in the same block. For reading, Which
occurs simultaneously With the tag check, each addressed
Word is read from both blocks, and the block With the
matching tag is selected. The data from the other block is
ignored. For Writing, Which occurs after the tag check has
been completed, only the addressed block is Written.
Referring to FIG. 5, each array is con?gured such that
Words 455—458 therein alternate betWeen the tWo cache
Ways to alloW either access pattern. This is Why the arrays
do not correspond to cache Ways.
Each cache set contains tWo blocks (e.g., block 0 and
for CPU and external interface is shoWn in Table III.
Adr[4] Way
parity bit generated by parity generator 499a for register
stores or parity generator 499b for external interface stores
Multiplexer 470, controlled by signal S0, selects 1 of 4
bits (1 from each of the 4 doubleWord in the block) to input
into sense ampli?er 475. Output of sense ampli?er repre
sents data from sub-array 0. Similarly, S1 controls Which bit
multiplexer 471 and sense ampli?er 476 is read from sub
register ?les or from external interface. Data from register
?les are aligned by store aligner 430 and multiplexed With
data from the external interface by multiplexers 488 and
489. Multiplexers select the data according to the select
signals and Write into cache via drivers 469a—469h located
SO (Array 0)
S1 (Array 1)
1000
0100
0010
0001
0100
1000
0001
0010
45
even doubleWords (Adr[3]=O) are in Array 0 and odd
doubleWords are in Array 1. This is reversed for block 1.
This alloWs access of a tWo doubleWord from the same Way
Which otherWise Would be impossible in conventional 2-Way
set-associative caches, unless the data line from each Way is
2-doubleWords in Width. This requires tWice the number of
sense ampli?ers.
Also, by using multiplexers, the external interface can
Data from sub-array 0 and sub-array 1 are either loaded
into the register ?les or Written to main memory via external
interface. For register loads, data are multiplexed With
corresponding sub-array 0 and sub-array 1 data from the
other cache bank by multiplexer 480 and 485. Select bank
signal (bit 5) passes data from the appropriate bank to load
aligner 428a and 428b, respectively, and into multiplexer
486. Cache hit signals, generated from tag check logic,
sWap doubleWords Within its quadWord accesses.
Sense ampli?ers, in comparison to ram cells, are relative
55
a factor of four, alloWing the ram cells to be more densely
packed. The bene?ts Which result are tWofold. First, by
being able to locate the cells in closer proximity to each
other, the propagation delay of array is decreased. Second,
dictate Which data, if any, are read out of the cache.
Whenever the cache is read and its tag indicates a valid
the signi?cant savings in chip space can be used for decreas
ing the die size to reduce cost, and/or designing more
aggressive functional units to effectively increase CPU
entry, all bytes are checked for proper parity by parity
checker 477. If any byte is in error during a load instruction,
the processor takes a “Data Cache Error” exception.
As for Writes to memory, multiplexer 490 and 491 select
the desired data as indicated by virtual addtess bit 5.
Thereafter, the data is passed into phase multiplexer circuit
lylarge. As such, the density of the cache arrays is limited by
the Width of the sense ampli?ers. By employing
multiplexers, the number of sense ampli?ers is reduced by
performance.
65
Referring to FIG. 6, the 576 cells of each line are arranged
in a pattern Which minimizes the Wiring in the load and store
aligners. At the high level, bits are interlaced by their
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement