RAID architecture with two
US006353895B1
(12)
United States Patent
(10) Patent No.:
(45) Date of Patent:
Stephenson
(54)
RAID ARCHITECTURE WITH TWO-DRIVE
FAULT TOLERANCE
(75)
Inventor: Dale J. Stephenson, Tracy, CA (US)
6,223,323 B1 *
Notice:
Mar. 5, 2002
4/2001 Wescott
OTHER PUBLICATIONS
M. Blaum, J. Brady, J. Bruck, and J. Menoa, “EVENODD:
An Ef?cient Scheme for Tolerating Double Disk Failures in
(73) Assignee: Adaptec, Inc., Milpitas, CA (US)
(*)
US 6,353,895 B1
RAID Architectures, ” 4—94, IEEE Transaction on Comput
ers, vol. 44, No. 2.
Subject to any disclaimer, the term of this
patent is extended or adjusted under 35
* cited by examiner
U.S.C. 154(b) by 0 days.
Primary Examiner—Gopal C. Ray
(21) Appl. No.: 09/250,657
Feb. 16, 1999
(22) Filed:
(74) Attorney, Agent, or Firm—Martine & Penilla, LLP
(57)
ABSTRACT
A two-dimensional parity arrangement that provides tWo
Related US. Application Data
(60)
Provisional application No. 60/075,273, ?led on Feb. 19,
drive fault tolerance in a RAID system is presented. The
1998.
parity arrangement uses simple exclusive-or (XOR) parity
(51)
Int. Cl.7 ....................... .. G06F 11/00; G06F 17/30;
(52)
(58)
US. Cl. .............................. .. 714/5; 714/6; 711/114
Field of Search
..................... .. 714/6, 7, 767,
codes rather than the more complex Reed-Solomon codes
used in a conventional RAID 6 implementation. User data
on the physical disk drives in the RAID system is arranged
into XOR roW parity sets and XOR column parity sets. The
XOR parity sets are distributed across the physical disk
drives by arranging the parity sets such that the data on each
G11B 5/00
714/770, 800, 5; 711/114, 100, 111; 707/202;
709/214
physical drive exists in tWo separate parity sets, With no
stripe unit in the same tWo sets. The storage lost due to parity
is equal to the capacity of tWo drives, or 2/N the total
capacity of an N-drive array. Accordingly, this parity
References Cited
(56)
US. PATENT DOCUMENTS
5446 855 A *
8/1995 D
t 1
arrangement uses less storage than mirroring When the
5’774’ 6 41 A * 6/1998 lsiilngl it 21'
number of total drives is greater than four.
6,138,125 A * 10/2000 DeMoss
6,219,800 B1 * 4/2001 Johnson et al.
11 Claims, 6 Drawing Sheets
1500*
START
I
60h!
A 2amcaiz'astztwacrs,at“
I
602x
LOOP FOR ALL TWO DRIVE FAILURE
COMBINATIONS. STEPS (603*610)
CONSTRUCT AN ARRAY FOR EACH FAILED
DISK WITH A FIELD FOR EACH ROWv
SET EACH FIELD TO FALSE [0]
I
FOR EACH PARITY (COLUMN OR ROW) UNIT ON AN
AFFECTED DRIVE SET THE FIELD TO TRUE.
I
506
12W)
507/
605
609/
LOOP: FOR EACH UNAFFECTED PARITY SET (FROM STEP EOI), LOOK FOR A
PARITY SET MEMBER ON THE OTHER DRIVE, IF THE PARITY SET MEMBER IS ALREADY
TRUE OR IS A PARITY UNIT, GO ON TO THE NEXT PARITV SET. OTHERWISE, DO STEP
505. WHEN ALL PARITY SETS HAVE BEEN DONE, GO TO STEP 607
I
MARK THE ROW FOR THE PARITY SET MEMBER TRUE, THE DISK
BLOCK JUST MARI\ED TRUE ALSO EIELONGS TO ANOTHER PARITY SET. LOOK
FOR ANOTHER MEMBER OF THAT PARITY SET ON THE OTHER DRIVE, IF THERE
AREN'T ANY, OR THAT MEMBER IS A PARITV BLOCK, OR THAT MEMBER IS
ALREADY TRUE, GO BACK TO STEP 605. OTHERWISE, REPEAT STEP 606
I
CONTINUE
I
CHECK THROUGH THE ROWS FOR ANY ROWS THAT ARE STILL FALSE.
IF THERE AREN'T ANY, GO BACK TO STEP 602
I
CHOOSE AN UNMARKED ROW TO BEGIN CONSTRUCTING A DEPENDENCY
CHAIN, AND ADD THE STRIPE UNIT IT REPRESENTS TO A DEPENDENCY LIST.
I
IF AN UNMARKED ROW IN THE OTHER DRIVE SHARES A PARITY SET OR ROW
WITH THE CURRENT UNMARKED ROW, MARK IT TRUE, MAKE IT CURRENT, AND ADD ITS
STRIPE UNIT TO THE DEPENDENCY LIST, REPEAT STEP 609 AS NECESSARY
I
INCREMENT THE NUMBER OF DEPENDENCIES. IF UNMARKED ROWS STILL
6/0
EXIST, GO BACK TO 505 AND START A NEW CHAIN,
I
CONTINUE
IF A STRIPE UNIT HAS A DEPENDENCY VALUE OF 0x09, IT BELONGS TO
DEPENDENCY CHAINS ou<<o> AND 3(I<<3)
I
EXIT
I
IE
I
U.S. Patent
Mar. 5,2002
Sheet 1 0f 6
US 6,353,895 B1
//02
COMPUTER
706
SYSTEM
I
DISK
707
CONTROLLER
704’
.
U.S. Patent
Mar. 5,2002
Sheet 2 0f 6
US 6,353,895 B1
mg.
AQ N
#5 ; [email protected]
MSW
NQ/
N
WGmD
NHQ
U.S. Patent
mmEm
MQFC
Mar. 5,2002
xO IE
Tm
x0 4 Nan
Sheet 3 0f 6
/
m/m
/
)
/
/
/
V35E56PS05
Q68
mFw Ew
US 6,353,895 B1
3
gm
MJQ
m:5MSm5&5
U.S. Patent
Mar. 5,2002
Sheet 4 0f 6
US 6,353,895 B1
@ M
FIND A RARITY
ARRANGEMENT FOR AN N—DRIVE
ARRAY SATISFYING CRITERIA 1
THROUGH 3
l
/\404
ANALYZE THE PARITY
ARRANGEMENT TO FIND THE
NUMBER OF UNRESOLVED
DEPENDENCIES
J
K406
ANALYZE THE ARRANGEMENTS
WITH ZERO UNRESOLVED DEPENDENCIES
TO FIND THE ONE WITH LEAST
RECONSTRUCTION OVERHEAD
W634
U.S. Patent
Mar. 5,2002
Sheet 5 0f 6
US 6,353,895 B1
F/GZ 5
MAKE THE FIRST STRIPE UNIT THE CURRENT BLOCK
I
>
502
INCREMENT THE PARITY SET OF THE CURRENT BLOCK
I
505
IF THE PARITY SET EXCEEDS THE NUMBER
OF DRIVES, SKIP TO STEP 511
I
II
504
IF THE PARITY SET MATCHES ANY OTHER STRIPE
UNITS IN THE ROW, GO TO STEP 502
I
II
505
IF THE PARITY SET MATCHES ANY OTHER STRIPE UNITS
IN THE COLUMN [DRIVE], GO TO STEP 502
I
II
.505
IF THE PARITY UNIT FOR THIS PARITY SET IS ON
THE SAME COLUMN [DRIVE], 60 TO STEP 502
I
507
THE PARITY SET DOES NOT CONFLICT,
SO INCREMENT THE CURRENT BLOCK
I
505
IF THE CURRENT BLOCK STILL REPRESENTS
A VALID STRIPE UNIT, SKIP TO STEP 510
I
509
THE PARITY SET IS A COMPLETE BIrI'IyAWQW-IZANALYZE
THEeU-HAMQWAI AND SET THE cURRENT BLOCK TO
THE VALUE RETURNED FROM THE ANALYSIS FUNCTION.
I
II
5/0
GO BACK TO STEP 502
5//
SET THE PARITY SET FOR THE
CURRENT BLOCK TO ZERO
I
II
5/2
DECREMENT THE CURRENT BLOCK
I
II
.573’
IF THE CURRENT BLOCK IS STILL A
VALID STRIPE UNIT, GO BACK TO STEP 502
I
ALL COMBINATIONS HAVE
BEEN CONSIDERED. END
5/4
U.S. Patent
Mar. 5,2002
Sheet 6 6f 6
US 6,353,895 B1
START
I
A MATRIX IS CONSTRUCTED WITH THE PARITY
50/“
502
SETS NOT AFFECTED FoR EACH DRIvE.
I
'\
LOOP FOR ALL TWO DRIVE FAILURE
' ’
COMBINATIONS, STEPS (603-610)
I
60.3-
CONSTRUCT AN ARRAY FOR EACH FAILED
DISK WITH A FIELD FOR EACH ROW,
SET EACH FIELD TO FALSE [0]
I
FoR EACH PARITY (COLUMN OR ROW) UNIT ON AN
504'“\
AFFECTED DRIvE SET THE FIELD TO TRUE.
I
605~\
LOOP: FOR EACH UNAFFECTED PARITY SET (FROM STEP 60-1), LOOK FOR A
PARITY SET MEMBER ON THE OTHER DRIVE. IF THE PARITY SET MEMBER IS ALREADY __
TRUE OR IS A PARITY UNIT, GO ON TO THE NEXT PARITY SET. OTHERWISE, DO STEP
606. WHEN ALL PARITY SETS HAVE BEEN DONE, GO TO STEP 607
I
MARK THE ROW FOR THE PARITY SET MEMBER TRUE, THE DISK
BLOCK JUST MARKED TRUE ALSO BELONGS TO ANOTHER PARITY SET. LOOK
FOR ANOTHER MEMBER OF THAT PARITY SET ON THE OTHER DRIVE. IF THERE
AREN'T ANY, OR THAT MEMBER IS A PARITY BLOCK, OR THAT MEMBER IS
ALREADY TRUE, GO BACK TO STEP 605. OTHERWISE, REPEAT STEP 606
I
CONTINUE
607_
\
CHECK THROUGH THE ROWS FOR ANY ROWS THAT ARE STILL FALSE.
IF THERE AREN'T ANY, GO BACK TO STEP 602
I
6
_
05 \
CHOOSE AN UNMARKED ROW TO BEGIN CONSTRUCTING A DEPENDENCY
CHAIN, AND ADD THE STRIPE UNIT IT REPRESENTS TO A DEPENDENCY LIST.
I
509»
IF AN UNMARKED ROW IN THE OTHER DRIVE SHARES A PARITY SET OR ROW
WITH THE CURRENT UNMARKED ROW, MARK IT TRUE, MAKE IT CURRENT, AND ADD ITS
STRIPE UNIT TO THE DEPENDENCY LIST, REPEAT STEP 609 AS NECESSARY
I
570'
INCREMENT THE NUMBER OF DEPENDENCIES, IF UNMARKED ROWS STILL
EXIST, GO BACK TO 608 AND START A NEW CHAIN,
I
622‘
6'77
—\
CONTINUE
IF A STRIPE UNIT HAS A DEPENDENCY VALUE OF 0x09, IT BELONGS TO
DEPENDENCY CHAINS 0(1<<o) AND 3(1<<3)
I
672
\
EXIT
.I
US 6,353,895 B1
1
2
RAID ARCHITECTURE WITH TWO-DRIVE
FAULT TOLERANCE
or (XOR) results of all data blocks in the parity disks roW.
The Write bottleneck is reduced because parity Write opera
tions are distributed across multiple disks.
CROSS REFERENCE TO RELATED
APPLICATIONS
The RAID 6 architecture is similar to RAID 5, but RAID
6 can overcome the failure of any tWo disks by using an
additional parity block for each roW (for a storage loss of
The present application claims priority bene?t of US.
Provisional Application No. 60/075,273, ?led Feb. 19, 1998.
2/N). The ?rst parity block (P) is calculated With XOR of the
data blocks. The second parity block (Q) employs Reed
BACKGROUND OF THE INVENTION
1. Field of the Invention
The disclosed invention relates to architectures for arrays
of disk drives, and more particularly, to disk array architec
tures that provide tWo-drive fault tolerance.
2. Description of the Related Art
Solomon codes.
RAID 6 provides for recovery from a tWo-drive failure,
but at a penalty in cost and complexity of the array controller
because the Reed-Solomon codes are complex and may
require signi?cant computational resources. The complexity
15
A Redundant Array of Independent Disks (RAID) is a
storage technology Wherein a collection of multiple disk
of Reed-Solomon codes may preclude the use of such codes
in softWare and may necessitate the use of expensive special
purpose hardWare. Thus, implementation of Reed-Solomon
codes in a disk array increases the cost and complexity of the
drives is organized into a disk array managed by a common
array controller. The array controller presents the array to the
array. Unlike the simpler XOR codes, Reed-Solomon codes
cannot easily be distributed among dedicated XOR proces
user as one or more virtual disks. Disk arrays are the
sors.
frameWork to Which RAID functionality is added in func
tional levels to produce cost-effective, highly available,
high-performance disk systems.
SUMMARY OF THE INVENTION
RAID level 0 is a performance-oriented striped data
mapping technique. Uniformly siZed blocks of storage are
25
(rather than Reed-Solomon codes). The XOR parity stripe
assigned in a regular sequence to all of the disks in the array.
units are distributed across the member disks in the array by
RAID 0 provides high I/O performance at loW cost. Reli
separating parity stripe units from data stripe units. In one
ability of a RAID 0 system is less than that of a single disk
embodiment, the number of data stripe units is the same as
drive because failure of any one of the drives in the array can
result in a loss of data.
the square of tWo less than the number of drives (i.e., (N—2
* N—2)). Each data stripe unit is a member of tWo separate
parity sets, With no tWo data stripe units sharing the same
RAID level 1, also called mirroring, provides simplicity
and a high level of data availability. A mirrored array
tWo parity sets. Advantageously, the storage loss to parity
includes tWo or more disks Wherein each disk contains an
stripe units is equal to the sum of the dimensions, so this
identical image of the data. A RAID level 1 array may use
parallel access for high data transfer rates When reading.
35
RAID 1 provides good data reliability and improves perfor
high cost.
RAID level 2 is a parallel mapping and protection tech
tolerance. The array includes tWo or more disk drives and a
disk controller. Data recovery from a one or tWo drive failure
nique that employs error correction codes (ECC) as a
correction scheme, but is considered unnecessary because
off-the-shelf drives come With ECC data protection. For this
a result, RAID 2 is rarely used.
RAID level 3 adds redundant information in the form of
parity arrangement uses less storage than mirroring When the
number of total drives is greater than four.
One embodiment includes a redundant array of indepen
dent disk drives that provides one-drive and tWo-drive fault
mance for read-intensive applications, but at a relatively
reason, RAID 2 has no current practical use, and the same
performance can be achieved by RAID 3 at a loWer cost. As
The present invention solves these and other problems by
providing tWo-drive fault tolerance using simple XOR codes
is accomplished by using a tWo-dimensional XOR parity
arrangement. The controller is con?gured to calculate roW
XOR parity sets and column XOR parity sets, and to
45
parity data to a parallel accessed striped array, permitting
distribute the parity sets across the disks drives in the array.
The parity sets are arranged in the array such that no data
block on any of the disk drives exists in tWo roW parity sets
or tWo column parity sets. In one embodiment, the controller
is con?gured to reduce reconstruction interdependencies
regeneration and rebuilding of lost data in the event of a
betWeen disk blocks.
single-disk failure. One stripe unit of parity protects corre
sponding stripe units of data on the remaining disks. RAID
BRIEF DESCRIPTION OF THE DRAWINGS
3 provides high data transfer rates and high data availability.
Moreover, the cost of RAID 3 is loWer than the cost of
mirroring since there is less redundancy in the stored data.
RAID level 4 uses parity concentrated on a single disk to 55
alloW error correction in the event of a single drive failure
(as in RAID 3). Unlike RAID 3, hoWever, member disks in
The advantages and features of the disclosed invention
Will readily be appreciated by persons skilled in the art from
the folloWing detailed description When read in conjunction
With the draWings listed beloW.
FIG. 1 is a hardWare block diagram shoWing attachment
a RAID 4 array are independently accessible. Thus RAID 4
is more suited to transaction processing environments
of one or more disk drives to a computer system.
involving short ?le transfers. RAID 4 and RAID 3 both have
a Write bottleneck associated With the parity disk, because
shoWing mapping of one or more physical disk drives to one
or more logical drives.
every Write operation modi?es the parity disk.
FIG. 3 is a logical block diagram shoWing data striping,
Wherein each logic block is equivalent to a stripe unit.
FIG. 2 is a logical block diagram of a disk array system
In RAID 5, parity data is distributed across some or all of
the member disks in the array. Thus, the RAID 5 architecture
achieves performance by striping data blocks among N
disks, and achieves fault-tolerance by using 1/N of its
storage for parity blocks, calculated by taking the exclusive
65
FIG. 4 is a ?oWchart shoWing an overvieW of the design
process.
FIG. 5 is a ?oWchart shoWing the processes steps of
?nding a column parity set.
US 6,353,895 B1
3
4
FIG. 6 is a ?owchart showing the processes steps of
analyzing a parity set to ?nd dependencies.
physical blocks 1.1, 2.1, and 3.1. A third stripe, stripe 3,
comprises physical blocks 1.3, 2.3, and 3.3. Logical blocks
In the drawings, the ?rst digit of any three-digit number
generally indicates the number of the ?gure in which the
element ?rst appears. Where four-digit reference numbers
are used, the ?rst two digits indicate the ?gure number.
0—2 are mapped into stripe 1 and logical blocks 6—8 are
mapped into stripe 3.
In many cases a user accessing data from the logical disks
will access the logical disk blocks consecutively. The stripe
mapping shown in FIG. 3 maps consecutive logical blocks
DETAILED DESCRIPTION OF THE
PREFERRED EMBODIMENT
FIG. 1 is a hardware block diagram showing attachment
10
performance because the disk operations will tend to be
of one or more disk drives to a computer system. In FIG. 1,
a disk controller 104 is attached to a computer system 102.
One or more disk drives 106—107 are provided to the
more uniformly distributed across all of the available disk
drives.
controller 104. Typically, the disk controller communicates
with a low level software program, known as a device driver,
15
running on the computer system 102. The device driver
controls the operation of the disk controller 104 and directs
the controller 104 to read and write data on the disks
106—107. As is well known, there may be more than one disk
controller 104 that may either be external to or part of the
tem such as RAID 3 and RAID 4, the array controller 208
106—107.
The present invention provides a parity arrangement
FIG. 2 is a logical block diagram of a disk array system
showing mapping of the disk drives 106—107 in an array 210
of the physical drives 106—107 to the logical drives 209 is
provide by an array controller 208 which may be imple
mented in hardware, software, or both.
The array controller 208 maps the physical drives
106—107 into logical disks 204—205 such that a computer
user 202 only “sees” the logical disks 204—205 rather than
the physical drives 106—107. The number of physical drives
106—107, and the siZe of the physical drives 106—107 may
be changed without affecting the number and siZe of the
logical drives 204—205. Several physical drives 106—107
may be mapped into a single logical drive. Conversely, one
of the physical drives 106—107 may be mapped into several
logical drives. In addition to mapping physical drives
106—107 to logical drives 204—205, the array controller
provides data striping of the data on the physical drives
25
(including the parity unit) are spread across different physi
35
Parity data is provided in an N-by-N parity map within the
array 210, where N is the number of physical drives, and the
storage capacity is equal to N-2 drives. One parity set
includes the stripe units on a given row (row parity), while
its complementary parity set is a column (column parity)
45
drawn from N-2 different rows (and also N-2 different
drives). The stripe units are also distributed in such a manner
that they can be striped. An example for a our drive array
having four stripes per drive (four-by-four) is shown in
Table 1. The data on each of the four drives is shown in
columns one through four. The data in each of the four
stripes is shown in rows one through four. The four-by-four
arrangement result in sixteen blocks, as shown. There are
of the physical disks 106—107 actually receives the data. In
eight blocks of actual user data, and eight parity blocks.
Each data block has a physical location (i.e., its physical
location on a disk drive) and a logical position (its position
in the two-dimensional parity arrangement). Each data block
order to balance I/O loads across the drives, the array
controller will often map consecutive logical blocks across
several physical drives, as shown in FIG. 3.
FIG. 3 shows an address mapping scheme known as disk
55
drives are mapped into units known as stripes. For
convenience, the present disclosure treats each stripe unit as
is a member of two parity sets, a row parity set and a column
parity set. Letters are used to denote the row parity for a data
block and numbers are used to denote column parity for a
data block. Parity blocks contain no user information, but
rather, only parity information. Each parity block is a
having only one block, with the understanding that a stripe
member of only one parity set, either a row parity set or a
may contain multiple blocks. FIG. 3 shows three member
column parity set. In Table 1, parity blocks are shown in
drives 301—303 in a disk array. Each member drive has three
parentheses.
physical disk blocks (a typical real-world disk drive would
labeled 3.1, 3.2, and 3.3. A ?rst stripe, stripe 1, includes
cal drives. Fourth, data is available after failure of any two
of the physical drives 106—107.
writes data to logical block 3, the user will not know which
have tens of thousands of blocks). The physical blocks on
member disk one 301 are labeled 1.1, 1.2, and 1.3. The
physical blocks on member disk two 302 are labeled 2.1, 2.2,
and 2.3. The physical blocks on member disk three 301 are
tolerance is provided using simple exclusive-or (XOR)
parity processing and also using 2/N of the physical drive
space for parity encoding. The two-drive XOR parity
arrangement can be described in terms of four criteria as
The array controller 208 maps data address on the physi
striping, wherein physical address blocks having the same
physical address but residing, on different physical disk
whereby the array controller 208 can correct for failure of
any two of the physical drives 106—107. Two-drive fault
follows. First, each stripe unit in the physical drives is a
member of two different parity sets. Second, different stripe
units have do not have common membership in both parity
sets with another stripe unit. Third, members of a parity set
106—107, and the array controller 208 corrects errors due to
the failure of one or more of the physical drives 106—107.
cal drives 106—107 into logical address in the logical disks
204—205. Logical addresses are typically described in terms
of logical blocks, numbered consecutively from 0 to N.
Typically, the user 202 does not know how logical addresses
map to physical addresses. Thus, for example, if the user 202
The extent to which the array controller 208 can correct
for multiple drive failures depends, in part, on the redun
dancy and/or parity (i.e., error correction) data stored on the
physical drives 106—107. In a single dimension parity sys
can correct errors due to failure of one of the physical disks
computer system 102.
into one or more logical disk drives 204—205. The mapping
across different disk drives. Thus a user accessing logical
blocks in a consecutive fashion will see improved I/O
65
For example, in Table 1, the block A2 is a data block
containing user data. Physically, the block A2 resides in the
?rst stripe unit on the second drive. Logically, the block A2
is a member of the row parity set A, and is also a member
of the column parity set 2.
US 6,353,895 B1
6
Table 2 shoWs a ?ve drive arrangement Wherein all
dependencies can be resolved.
TABLE 1
Stripe
Stripe
Stripe
Stripe
1
2
2
4
Drive 1
Drive 2
Drive 3
Drive 4
A3
(2)
C1
(4)
A2
(B)
C4
(D)
(1)
B4
(3)
D2
(A)
B1
(C)
D3
The arrangement shown in Table 1 visually ?ts the ?rst
three criteria. First, each stripe unit (user data block) is a
member of tWo different parity sets. Second, different stripe
TABLE 2
A2
B1
(c)
D4
(5)
A5
(B)
c3
(4)
E1
(1)
B3
c2
(D)
E5
(A)
B2
(3)
D5
E4
1O
With larger sets, involving more than four drives, it is
possible to construct parity arrangements that satisfy the ?rst
three criteria, but that have circular (unresolvable) depen
dencies. A parity arrangement With circular dependencies
units do not have common membership in both parity sets
With another stripe unit. Thus, for example, there is only one
block A2. Third, members of a parity set (including the
parity unit) are spread across different physical drives. For
example, the column parity set 1 is spread across drives 1,
3, and 4, the roW parity setAis spread across drives 1, 2, and
A1
(2)
c4
D3
(E)
15
Will have some data blocks that cannot be reconstructed after
a tWo-drive failure. Consider, for example, the six drive
arrangement shoWn in Table 3.
4.
TABLE 3
With regards to the fourth criteria, for this simple
arrangement, there are 48 different stripe-unit/drive combi
drive
nations to consider (eight different stripe units, six possible
tWo-drive failure combinations). Forty of these can be
handled by using surviving members, While eight have
dependencies that require the reconstruction of another
stripe unit.
25
Within an XOR parity set (either roW or column) the value
of any block in the set is computed as simply the XOR
1
2
3
4
5
6
A2
B5
(3)
D1
A3
B4
(C)
D2
A4
(2)
C1
D6
A5
(B)
C2
D3
(1)
B3
C6
(4)
(A)
BI
C5
(D)
E4
(6)
E6
(5)
F3
F1
E2
F5
E3
F4
(denoted by the symbol “GB”) of all of the other blocks in the
set. Thus, for example, Table 1 shoWs a roW parity set “A”
If the 4th and 6th drives failed, stripe units (blocks) D3
having members (A), A2 and A3. (Note that the block A2 is
and E3 Would be unrecoverable. Neither D3 nor E3 can be
reconstructed by use of the roW parity groups, since roW
also a member of the column parity set “2”, and the block
A3 is also a member of the column parity set “3”). The
blocks A2 and A3 contain actual user data. The block (A) is
the parity block. Thus, the folloWing relationships are all
valid:
parity units (D) and
are on the failed drives. S0 D3 and
E3 both Would need to be reconstructed by use of the column
35
parity set
Recall that any one member of a parity set can
be reconstructed from the other members of the set. If tWo
members of a parity set are missing, then the set cannot be
reconstructed. Both D3 and E3 are members of the same
column parity set, set
When the user Writes data to a disk block, the parity
blocks corresponding to that disk block are recomputed.
Thus, for example, if the user Writes neW data to the block
A2, then the value of the roW parity block (A) is recomputed
as (A)=A2G9A3 and stored, and the value of the column
parity block (2) is recomputed as (2)=A2G9D2 and stored.
With the values of (A) and (2) computed and saved as
Thus reconstruction of D3 from
the column parity set (3) requires that E3 be reconstructed
?rst (and vice versa). Thus, D3 and E3 cannot be recon
structed.
Constructing the Parity Arrangement
As shoWn in Table 3 above, there are many possible
45
arrangements (schemes) of data and parity blocks. Although
it is possible to construct parity sets that have circular
dependencies, it is also possible to construct parity sets that
have no circular dependencies. Clearly, the most desirable
arrangements are arrangements that have no circular depen
dencies.
Even When an arrangement has no circular dependencies,
there may be interdependencies (as in the case above Where
above, then the value of A2 can be reconstructed if needed.
If drive 2 (the drive containing A2) should fail, the value of
A2 can be reconstructed from either of the folloWing rela
tionships:
If, however, the ?rst tWo drives (drives 1 and 2) in Table
1 fail, then both (2) and A3 are unavailable, since (2) is
in a tWo-drive failure, A2 Was dependent on
Interde
pendencies create additional overhead When a block must be
stored on drive 1 and A3 is stored on drive 2. As shoWn in 55 reconstructed. Thus, the most ef?cient parity arrangements
are those arrangements that provide the loWest reconstruc
the above equations, at least one of the value (2) or A3 is
needed to reconstruct A2. Fortunately, A3 can be recon
tion overhead (i.e., the arrangements that have the feWest
structed from A3=(3) G9D3, because (3) is stored on drive 3
interdependencies).
and D3 is on drive 4. Thus, A2 is dependent on A3 to survive
this particular tWo-drive failure. If both drive 1 and drive 2
FIG. 4 is an overvieW ?oWchart of the identi?cation
process. The process shoWn in FIG. 4 begins at a ?nd
fail, A2 can be reconstructed by calculating A3=(3)G9D3
process block 402, Which includes ?nding a parity arrange
folloWed by A2=(A)G9A3.
ment for an N-drive array satisfying the ?rst three criteria
above. The process then advances to a ?rst analysis block
All of the dependencies in the four drive arrangement
shoWn in Table 1 can be resolved. Thus, the failure of any
tWo drives in Table 1 Will not result in the loss of data
because the data in all of the blocks on the failed disks can
be reconstructed.
65
404 Where the parity arrangement is analyZed to ?nd the
number of unresolved dependencies. The process then
advances to a second analysis block 406 Where the arrange
ments Zero unresolved dependencies (found in process block
US 6,353,895 B1
8
7
404) are analyzed to ?nd a parity arrangement With the
lowest reconstruction overhead.
In the process block 511, the process sets the parity set for
the current block to Zero and advances to a process block
In the ?nd process block 402, the process declares an
512. In the process block 512, the process decrements the
integer matrix With siZe N><N. It is assumed that the stripe
current block and advances to a process block 513. In the
units should be striped, and also that each roW Will contain
both a roW parity unit and a column parity unit. Furthermore,
it is assumed that all stripe units in a given roW Will comprise
the parity set for that roW. So the process begins by initial
process block 513, if the current block is still a valid stripe
unit, then the process jumps back to the process block 502;
otherWise, the process advances to a process block 514.
When the process reaches the process block 514, all of the
possible combinations have been considered, and the pro
iZing the matrix according to the folloWing pattern (example
is a 6x6 array) as shoWn in Table 4.
10 cess exits.
An optional de?ne can be used to insert a step 4a—If the
TABLE 4
0
0
R
0
0
R
0
0
c
0
0
c
0
R
0
0
R
0
0
c
0
0
c
0
R
0
0
R
0
0
c
0
0
c
0
0
In Table 4, a value of 0 is a stripe unit not yet assigned to
a column set, R represents the roW parity unit, and C
represents a column parity set unit (internally, R and C are
set to 0x80 and 0x40 respectively). Each C is associated
With a parity set equal to its roW, the C in the ?rst roW
belongs to set 1, the C in the second roW belongs to set 2,
etc. If the roWs Were counted from 0, the unassigned blocks
Would be set to —1 for this algorithm.
15
parity block for this parity set is in this roW, go to step 2. This
is not a logical requirement, but can reduce the number of
combinations considered. Another optional de?ne can be
used to ?ll a number of blocks (generally the stripe units in
the ?rst roW) With assigned parity sets, and terminating the
program When it makes its Way back to that level.
FIG. 6 is a ?oWchart shoWing the steps of analyZing the
arrangement. After ?nding a parity arrangement that does
not violate the ?rst three criteria the parity arrangement is
analyZed for unresolvable (circular) tWo-disk failure depen
dencies. The ?oWchart in FIG. 6 begins at a process block
601 Where a matrix is constructed With the parity sets not
affected for each drive (every drive Will have tWo parity sets,
25
either column or roW, that do not appear on the drive). The
process then advances to a loop block 602. The loop block
In this example, the roW parity alWays precedes the
column parity. An optional de?nition alloWs the order of R
602 provides tWo nested loops to iterate through each
tWo-drive failure combination. A loop counter failil iter
and C to alternate Within a parity arrangement. If an R and
ates from 0 to N-2, and a loop counter faili2 iterates from
faili1+1 to N-l. The ?rst process block inside the loop is
C can share the same column (alWays the case With an odd
number of drives), the sets they are produced from can have
no stripe units in common (the program maintains a list of
bad roW/column combinations to make sure the rule is not
a process block 603 Where an array is constructed for each
violated).
The program then proceeds through the array according to
35
the ?oWchart shoWn in FIG. 5, beginning at a process block
501. In the process block 501 the process sets the ?rst stripe
failed disk, With a ?eld for each roW. Each ?eld is initially
set to 0 (false) to indicate that a stripe unit can be recon
structed.
The process then advances to a process block 604 Where,
for each parity (column or roW) unit on an affected drive, the
roW is set to 1 (true). The process then advances to a process
unit as the current block and then advances to a process
block 605, Where for each unaffected parity set (from the
block 502. In the process block 502, the process increments
process block 601), the process looks for a parity set
member on the other drive. If the parity set member is
already true, or is a parity unit, then the next parity set is
checked; otherWise, the process advances to a process block
606. When all parity sets have been checked, the process
the parity set of the current block and advances to a process
block 503. In the process block 503, if the parity set exceed
s the number of drives, then the process jumps forWard to a
process block 511; otherWise, the process advances to a
process block 504. In the process block 504, if the parity set
matches any other stripe units in the roW, then the process
returns to the process block 502, otherWise, the process
advances to a process block 607.
45
advances to a process block 505.
In the process block 505, if the parity set matches any
other stripe units in the column (drive), then the process
returns to the process block 502, otherWise, the process
advances to a process block 506. In the process block 506,
if the parity unit for this parity set is on the same column
(drive), then the process returns to the process block 502;
otherWise, the process advances to a process block 507. In
the process block 507, it is assumed that the parity set does
not con?ict, and the current block is incremented and the
In the process block 606, the roW for the parity set
member is marked (set) to true. The disk block just marked
true also belongs to another parity set. The process block
then looks for another member of that parity set on the other
drive (this is folloWing a resolvable dependency). If there are
none, or that member is a parity block, or that member is
already true, then the process jumps back to the process
block 605; otherWise the process repeats the process block
606.
By the time the process reaches the process block 607, the
55
process has identi?ed all is the blocks that can be recon
structed from the particular tWo-drive failure indicated by
block 508, if the current block still represents a valid stripe
unit, then the process jumps to a process block 510;
failil and faili2. The process must noW search for blocks
that cannot be reconstructed. The process ?rst checks
through the roWs for any roWs that are still false. If there
otherWise, the process advances to a process block 509.
roWs that are false, the process advances to a process block
Upon reaching the process block 509, the process has
identi?ed a complete parity arrangement. In the process
block 509, the process performs the analysis shoWn in
608; otherWise, the process jumps back to the process block
process advances to a process block 508. In the process
602.
If the process reaches process block 608, it means that
there is a block that cannot be reconstructed (process block
connection With FIG. 6 and sets the current block to the
value returned from the analysis function. The process then
advances to the process block 510. In the process block 510,
the process jumps back to the process block 502.
65
606 already provided the blocks that can be reconstructed).
Thus, the dependency chains identify members of circular
dependencies, Which are used to shorten the searching
US 6,353,895 B1
10
By counting the number of reconstructions, it is possible
to identify the best Zero-dependency schemes.
process. To ?nd dependencies, the process chooses an
unmarked (still false) roW to begin constructing a depen
dency chain. The stripe unit represented by the chosen roW
Arrangements
is added to a dependency list. The process then advances to
The procedures listed above in connection With FIGS. 4,
a process block 609.
5, and 6 identify many schemes With no circular or unre
In the process block 609, if an unmarked roW in the other
failed drive shares a parity set or roW With the current
unmarked roW, the process marks the roW true, makes it
current, and adds its stripe unit to the dependency list. The
process loops in the process block 609 through all unmarked
roWs and then advances to a process block 610.
10
In the process block 610, the process increments the
number of circular or unresolvable dependencies. If
unmarked roWs still eXist, the process jumps back to the
process block 608 and starts a neW chain; otherWise, the
process jumps back to the process block 602.
Aprocess block 622 is the end block for the nested loops
started in the process block 602. When the nested loops are
complete, the process advances to a process block 611.
When the process reaches the process block 611, all tWo
drive failure combinations have been evaluated and the
process has a list of blocks in a dependency for this
15
solved dependencies. In many cases, there are multiple
solutions. For eXample, for a siX drive array, there are 29,568
distinct schemes that meet the desired criteria. A four-drive
array is listed in Table 1 above.
There are siXteen distinct four-drive schemes With the
same parity unit placement and the same number of recon
structions. The average number of reconstructions per failed
stripe unit in a tWo-drive failure is 4/3, and the average
number of reconstructions per stripe-unit in a tWo-drive
failure is 2/3.
There are tWo alternate four-drive parity schemes, shoWn
in Tables 6 and 7, that evenly distribute column parity. Both
of these schemes have 2/3 reconstructions per stripe unit in
a tWo-drive failure, and 4/3 reconstructions per failed stripe
unit in a tWo-drive failure. The four-drive schemes offer no
capacity savings over a RAID 1 scheme, and are more
arrangement. In one embodiment, the blocks are stored as a
complex. HoWever, the four-drive schemes Will provide full
binary tag for each stripe unit. If a stripe unit has a
recovery after the loss of any tWo drives, Which RAID 1
cannot. The four drive array has 8 logical blocks. Table 8
dependency value of 09 (hexadecimal), it belongs to depen
dency chains 0 (1<<0) and 3 (1<<3). A stripe unit With a tag
25
of 0 has no unresolved dependencies. If this parity arrange
ment has feWer dependencies (or in a Zero-dependency case,
shoWs hoW physical to logical addresses are mapped (i.e.,
hoW the array is stripped) in the four drive arrangements.
TABLE 6
the same number) than any previous arrangement, the parity
arrangement and the dependencies are saved to disk. Upon
completion of the process block 611, the process advances
A1
(2)
c4
(D)
to an eXit block 612, and eXits.
The process block 612 returns a neW current block for the
A2
(B)
03
(4)
(1)
B2
(c)
D3
(A)
B1
(3)
D4
(1)
B3
(c)
D4
(A)
B2
(3)
D1
P
2
P
6
P
3
P
7
main iteration routine (all following stripe units are then
cleared). Since the process searches for an arrangement With
no dependencies, the process Will look at the highest block
number in each dependency, and select the loWest of these
highest blocks to return. Changing a later block in the
35
TABLE 7
A4
(2)
c3
(D)
arrangement Would not have the possibility of removing the
dependency. For the eXample siX drive parity arrangement
(D3/E3 dependency) described in connection With Table 3,
the analysis routine Would return 17, the logical stripe unit
A1
(B)
02
(4)
number of E3. Block numbers With Zero circular dependen
cies return the largest logical stripe unit number in the
arrangement, plus one.
Minimizing reconstruction
TABLE 8
0
P
4
P
45
The process block 606 of the above algorithm is more
than a method of identifying stripe units that may be
reconstructed. The process block 606 also identi?es the steps
1
P
5
P
A ?ve-drive (?ve-by-?ve) parity arrangement is shoWn in
Table 9. This arrangement has 2/3 reconstructions per stripe
required to actually recreate the stripe units. For any given
number of drives, it is desired to select an arrangement that
reduces the number of reconstruction involved in a drive
unit in a tWo-drive failure, and 5/3 reconstructions per failed
stripe unit in a tWo-drive failure. The ?ve-drive parity
arrangement is striped as shoWn in Table 10.
failure. The process block 606 maintains a counter to count
the number of reconstruction steps involved. Each iteration
in the process block 606 represents a reconstruction of a
55
stripe unit and a reconstruction that must occur to recon
struct another stripe unit. For each tWo-drive failure, the
number of reconstructions necessary is described by the
folloWing algorithm:
hop=(number of iterations reconstruction hops in the
process block 606)
TABLE 9
A2
B1
(c)
D4
(5)
A1
(2)
c4
D3
(E)
0
5
1
P
A5
(B)
c3
(4)
E1
(1)
B3
c2
(D)
E5
(A)
B2
(3)
D5
E4
P
3
P
4
total=(running count of reconstructions necessary for par
ity arrangement)
(after every step 6)
While (hop)
total+=hop--,
end While
TABLE 10
65
2
P
US 6,353,895 B1
P
10
P
11
12
TABLE 10-continued
TABLE 15 -continued
6
11
P
7
P
12
8
P
13
P
9
14
14
P
21
28
P
P
15
22
29
P
P
16
23
P
30
10
17
24
P
31
11
18
P
25
32
12
19
P
26
33
13
P
20
27
34
AsiX-drive (siX-by-siX) XOR parity arrangement is shown
in Table 11. Data blocks are striped as shoWn in Table 12.
This parity arrangement has 0.683 reconstructions per stripe
An eight drive (eight-by-eight) parity arrangement is
10
unit in a tWo-drive failure, and 2.05 reconstructions per
failed stripe unit in a tWo-drive failure. The parity algorithm
shoWn in Table 16. This arrangement averages 0.70 recon
structions per stripe unit in a tWo-drive failure, and 2.81
reconstructions per failed stripe unit in a tWo-drive failure.
found 246 schemes With the same reconstruction demands.
A slightly less efficient arrangement that evenly distributes
There are no Zero dependency schemes that stripe the
column parity sets across all columns. HoWever, a global 15 column parity is shoWn in Table 17. This parity arrangement
uses 0.72 reconstructions per stripe unit in a tWo-drive
striping effect can be instituted by reordering columns on the
failure, and 2.87 reconstructions per failed stripe unit in a
tWo-drive failure. An eight drive system is striped as shoWn
in Table 18.
disk. For example, every other siX-by-siX block could have
the alternate parity arrangement shoWn in Table 13 to
complement the arrangement shoWn in Table 11.
20
TABLE 11
A1
B4
(3)
D2
E5
(6)
A2
B5
(C)
D6
E3
(F)
A3
(2)
C6
D4
(5)
F1
TABLE 16
A4
(B)
C1
D5
(E)
F2
(1)
B2
c5
(4)
E6
F3
(A)
B6
c3
(D)
E1
F4
TABLE 12
A2
B6
C5
(4)
E1
F7
G3
(8)
25
1
2
3
P
P
6
7
P
P
4
5
P
12
18
P
P
13
19
P
8
14
P
20
9
15
P
21
10
P
16
22
11
P
17
23
(C)
D6
E3
(F)
A5
B7
(C)
D8
E4
F3
(G)
H1
A7
(2)
C4
D3
E8
(6)
G1
H5
A6
(B)
C8
D2
E7
(E)
G5
H3
(1)
B3
C7
D6
(5)
F2
G8
H4
(A)
B4
C1
D7
(E)
F8
G6
H2
TABLE 17
35
TABLE 13
A1
B4
A4
B8
(3)
D1
E2
F5
(7)
H6
30
0
A2
B5
A3
B5
C2
(D)
E6
F1
G4
(H)
A3
(2)
(A)
B6
(3)
D2
A4
(B)
C1
D5
C6
D4
C3
(D)
(1)
B2
C5
(4)
E5
(6)
(E)
F2
(5)
F1
E1
F4
E6
F3
A1
B2
C5
(4)
E8
F7
G6
A2
B3
C1
(D)
E7
F5
G4
A4
B1
(3)
D7
E6
F8
(G)
A3
B4
(C)
D2
E5
F6
(7)
A6
(2)
C4
D1
E3
(F)
G7
A8
(B)
C3
D4
E2
(6)
G5
(1)
B8
C6
D5
(E)
F4
G2
(A)
B7
C8
D6
(5)
F3
G1
(H)
(8)
H5
H1
H8
H7
H3
H2
40
TABLE 18
45
0
8
1
9
2
10
3
11
4
P
5
P
P
6
P
7
16
P
17
P
P
18
P
19
12
20
13
21
14
22
15
23
A seven drive (seven-by-seven) XOR parity arrangement
24
25
26
27
28
29
P
P
is shoWn in Table 14. The seven drive array is striped as
23
ii
3;
3;
3g
3:
2g
2;
P
P
42
43
44
45
46
47
shown in Table 15. This arrangement evenly stripes the
parity across all columns. It requires 2/3 reconstructions per
stripe unit in a tWo-drive failure, and 7/3 reconstructions per
failed stripe unit in a tWo-drive failure.
TABLE 14
50
Array Controller Programs
In one embodiment, the tWo-dimensional XOR parity
arrangements are implemented in array processor 208.
55 Operation of the array processor 208 is described in terms of
“programs” Written using a disk command programming
A3
A2
A1
A7
A6
(1)
(A)
2%
B1
(3)
B7
(C)
(2)
C5
(B)
B4
B3
language described beloW. The actual operations performed
(D)
E6
F5
D6
E5
F4
D5
E4
(6)
D4
E3
(F)
D3
(5)
F1
D2
(E)
F7
(4)
E7
F6
by the controller 208 are implemented in hardWare,
softWare, or both. The eXample programs beloW are directed
60 toWards a siX disk array using the parity arrangement shoWn
(7)
(G)
G2
G1
G7
G6
G5
C4
C3
C2
_
_
in Table 19 and the data striping (physical to logical address
mapping) shoWn in Table 20.
TABLE 15
0
7
1
8
2
9
3
P
TABLE 19
4
P
P
5
P
6
65
A2
B5
A3
B6
A4
(2)
A5
(B)
(1)
B3
(A)
B4
US 6,353,895 B1
14
Disk control program for the tWo-drive XOR parity
arrangement are provided beloW in three forms: (A) Without
any XOR on the drives; (B) With XOR on the drives but
TABLE 19-continued
(3)
D1
E4
(6)
(C)
D2
E1
(E)
C1
D6
(5)
F3
C2
D3
(E)
F4
C6
(4)
E2
F5
C5
(D)
E6
F1
Without third party commands (e.g., XDWrite Extended,
Regenerate, Rebuild); and, ?nally, (C) With XOR and third
party commands.
The programs beloW are described using the folloWing
syntax:
Code(Drive, Stripe Units, [Drive, Stripe Unit])
TABLE 20
0
6
P
12
18
P
1
7
P
13
19
P
2
P
8
14
P
20
3
P
9
15
P
21
10
P
4
10
P
16
22
P
5
11
P
17
23
parentheses represent a host command, and the square
brackets (if present) represent a secondary command asso
ciated With the XDWriteExtended, Rebuild, or Regenerate
commands. Host buffers are represented as double loWer
15
In one embodiment, the tWo-set XOR parity arrangement
is implemented in a disk array comprising a software
disk-driver and six SeagateTM BarracudaTM FibrechannelTM
drives. In addition to the normal read/Write primitives, the
FibrechannelTM drives support a class of special XOR
commands, Which perform XOR calculations in the disk
electronics, maintain a separate XOR result buffer, and can
interact directly With other drives. The special XOR com
mands help facilitate RAID 5 operations. In the present
application, the special XOR is commands are used to
provide tWo-drive fault tolerance. Conventional disk drives
Where Code is a command code listed in Table 20. The
case letters (e.g., aa, bb, cc, etc), and the buffer in a disk
drive is indicated by #drive# Where drive is an integer. Host
XOR commands have only buffers in their list. Comments
are indicated by a double slash “//”. The symbols “Q” and
“e” denote data transfer. Multiple operations may be coded
using one command code. For example, the commands to
Write A1 and A2 to drive 1 and Write A3 to drive 2 can be
coded as WR(1, A1, A2)WR(2, A3) or WR(1,A1, A2)(2,A3).
25
provide the folloWing primitive operations:
NeW data in a data stripe unit is indicated by an uppercase
initial letter, While old data is indicated With a loWercase
initial letter. Parity units are in bold, and preceded With a P
or p. For the purposes of the present description, and for
convenience, operations are described in terms of Whole
stripe units. HoWever, operations on less than Whole stripe
Read: Read block(s) from disk storage into host memory.
Write: Write block(s) from host memory into disk storage.
units are Within the scope of the present disclosure.
The special XOR commands provide the folloWing addi
TABLE 20
tional primitive operations:
Command Codes
XDWrite: Write block(s) from host memory into disk
storage. The disk performs an XOR betWeen the old and neW
contents of disk storage before committing the Write. The
Read
Write
35
XDWrite, destructive
XDWrite, non-destructive
results of the XOR are retained in an XOR buffer, and are
obtained by using an XDRead command. This command
may be issued nondestructively, With the XOR being per
XDRead
XPWrite
XDWriteExtended
formed but the neW blocks not being Written to disk.
XDRead: Reads the contents of a disk’s XOR buffer into
host memory.
XPWrite: Write block(s) from host memory into a disk’s
buffer. The disk performs an XOR betWeen the old contents
of disk storage and the buffer, and commits the results of the
XOR to disk storage.
XDWriteExtended: Write block(s) from host memory to
disk storage. The disk performs an XOR betWeen the old and
neW contents of disk storage before committing the Write.
XDWriteExtended, non-destructive
RB — Rebuild
RG — Regenerate
XOR - Host XOR
45
The ordering of commands is shoWn for convenience. In
actual operation, commands may be reordered to take advan
tage of asynchronous disk access, but operations involving
the same disk Will typically be in the order listed.
The parity arrangement for the folloWing examples is
The disk also initiates an XPWrite command to another disk.
Regenerate: The host sends the disk a list of blocks on one
or more other disks to be used in conjunction With block(s)
on the target disk. The target disk initiates reads to the other
given in Table 21:
TABLE 21
disks, and performs an XOR operation betWeen the blocks
on all disks in the list, including itself. All transfers have the
same number of blocks. The results of the XOR operations
are retained in the XOR buffer to be obtained using an
XDRead command. This command may also be issued With
blocks from the host to be used in an additional XOR
operation.
Rebuild: The host sends the disk a list of blocks on one or
more other disks to be Written to a location on the target disk.
The target disk initiates reads to the other disks, performs an
XOR operation betWeen all the blocks, and commits the
results to disk storage. All transfers must have the same
number of blocks. This command may also be issued With
blocks from the host to be used in an additional XOR
operation.
Drive
55
1
2
3
4
5
6
A2
B5
P3
D1
EA
P6
A3
B6
PC
D2
E1
PF
A4
P2
C1
D6
P5
F3
A5
PB
C2
D3
PE
F4
P1
B3
C6
P4
E2
F5
PA
B4
C5
PD
E6
F1
The examples are arranged according to the folloWing
outline:
A. Sample Code for Non-XOR implementation:
1. Non-degraded mode (no drive failures)
2. Degraded mode (single drive failure)
US 6,353,895 B1
15
16
// EXample: Write A2—A4 (3/4 of a stripe)
3. Double degraded mode, tWo-drive failure
4. Rebuilding mode
B. Sample Code for XOR implementation
RD( 1, a2) (2, a3) (3, a4) (6, pA) RD( 1, p3) (3, p2) (5,
p4)
aaeXOR(A2, a2)
bbeXOR(A3, a3)
cceXOR(A4, a4)
1. Non-degraded mode (no drive failures)
2. Degraded mode (single drive failure)
3. Double degraded mode
4. Rebuilding mode
C. Sample Code for XOR implementation With special
PAeXOR (aa, bb, cc, pA)
XOR commands
1. Non-degraded mode (no drive failures)
2. Degraded mode (single drive failure)
1O
3. Double degraded mode
4. Rebuilding mode
A. Sample Code for Non-XOR implementation
1. Non-degraded mode (no drive failures)
P4+XOR (cc, p4)
WR(1, P3) (3, P2 ) (5, P4)
15
Read data. The array is fully striped, as in RAID 5.
RD(1,a2) \\Reads one stripe unit
RD(1,b5)RD(2,b6) \\Reads tWo stripe units
Perform an Update Write (small Writes) using the folloW
ing steps: (1) Read stripe unit to be Written; (2) read old roW
and column parities; (3) Write out neW stripe unit; (4) XOR
With old/neW data and parities to calculate update XOR; and
(5) reWrite parities.
Perform a Very Large Write (includes complete parity
arrangement) using the folloWing steps: (1) Write out all
//EXample: Write A2
RD(1, a2) (3, p2) (6, pA)
data; (2) Perform XOR on all roWs to calculate parity A—F;
(3) Write to disk; (4) Perform XOR on all column sets 1—6;
and (5) Write to disk.
PAeXOR(A2, A3, A4, A5)
PBeXOR(B3, B4, B5, B6)
PCeXOR(C1, C2, C6, C5)
PDeXOR(D1, D2, D6, D3)
PEeXOR(E2, E6, E4, E1)
PFeXOR(F3, F4, F5, F1)
P1eXOR(C1, D1, E1, E1)
WR(1, A2) (2, A3) (3, A4) ( 6, PA)
P2eXOR(aa, p2)
P3eXOR(bb, p3)
aaeXOR (A2,a2)
PAeXOR (aa, pA)
25
P2eXOR(aa, p2)
WR(1, A2) (3, P2) (6, PA)
2. Degraded mode (single drive failure)
Perform a Read from failed drive using the folloWing
steps: (1) Read all other members of the roW parity set
(including the roW parity); and (2) Use XOR to regenerate
the missing stripe unit. The column parity set could be used,
but the roW stripe is more likely to take advantage of caching
or of reads larger than the failed stripe unit.
// EXample: Drive 1 failed, read A2
35
RD( 2, a3) (3, a4) (4, a5) (6, pA)
a2eXOR (a3, a4, a5, pA)
Perform a Write to failed drive using the folloWing steps:
WR(1, A2, B5, P3, D1, E4, P6)
WR(2, A3, B6, PC, D2,E1, PF)
WR(3, A4, P2,C1, D6, P5, F3)
WR(4, A5, PB,C2, D3, PE, F4)
WR(5, P1, B3, C6, P4, E2, E5)
WR(6, PA, B4,C5, PD, E6, E1)
Perform a Stripe Write (includes complete roW) using the
folloWing steps: (1) Read all data stripe units from roW; (2)
(1) Regenerate striped unit as above; and (2) folloW update
Write procedures.
// EXample: Drive 1 failed, Write A2
45
Perform XOR on all neW data in roW; (3) Write to roW parity;
(4) Write out neW data stripe units; (5) Read in column
To perform a Write to drive With failed parity unit,
perform a Write as normal but ignore the failed parity unit.
parities; (6) perform update XOR; and (7) reWrite column
parity.
// EXample: Drive 6 failed, Write A2
// EXample—Write A2—A5.
RD(1,a2) (2,a3) (3,a4) (4,a5)
RD(I, P3) (3, P2, P5) (5, P4)
PAeXOR(A2, A3, A4, A5)
WR (1, A2) (2, A3) (3, A4) (4, A5)(6, PA)
P2eXOR(a2, A2, p2)
P3eXOR(a3, A3, p3)
P4eXOR(a4, A4, p4)
P5eXOR(a5, A5, p5)
WR (1, P3) (3, P2, P5) (5, P4)
55
out neW stripe units; (4) XOR With old/neW data and parities
to calculate update XOR; and (5) reWrite all parities.
RD(1, a2) (3, p2)
P2eXOR(a2, A2, p2)
WR(1, A2) (3, P2)
3. Double degraded mode, tWo-drive failure
Perform a Write to failed drive With failed parity drive
using the folloWing steps: (1) Read the remaining parity set;
(2) Recalculate the parity set from available stripe units; and
(3) Write out the parity unit ignoring the failed parity drive.
// EXample: Drives 1 and 6 failed, Write to A2
RD (2, d2) (4, c2) (5, e2)
Perform a Partial Stripe Write (not a complete roW) using
the folloWing, steps: (1) Read all stripe units to be overWrit
ten; (2) Read stripe parity and all column parities; (3) Write
RD( 2, a3) (3, a4, p2) (4, a5) (6, pA)
a2eXOR(a3, a4, a5, pA)
aaeXOR(a2, A2)
PA<XOR(aa, pA)
P2eXOR(aa, p2)
WR(3, P2) (6, PA)
WR(3, P2)
65
To Write to an intact drive With both parity drives gone,
perform a normal Write.
// EXample: Drives 3 and 6 failed, Write A2
US 6,353,895 B1
17
18
WR(1, A2)
To Read from a failed drive With one parity drive doWn,
TABLE 22
reconstruct any stripe units needed for remaining parity set.
// Example one: Drives 1 & 6 failed, read A2
// In this example, all other stripe units in the parity set are
available
RD (2, d2) (3, p2) (4, c2),(5, e2)
A2eXOR(d2, c2, e2, p2)
// Example tWo: Drives 1 and 3 failed, read A2
// In this example the ‘A’ set must be used to reconstruct.
The Worst case Would be repeated requests for A3 With
// HoWever, A4 is not available (it is on drive 3).
// A4 must be reconstructed from the ‘4’ set, Which
requires E4 (it is on drive 1)
15
// E4 must be reconstructed from the E set, Which is
otherWise intact
culations Without requiring repeated reads.
4. Rebuilding mode
In rebuilding mode, data is regenerated as if a drive has
failed, and then Written directly to the drive. Parity units on
the rebuilding drive are generated by XOR of all its set
members.
RD(4, f4) (5, p4) (6, b4)
B. Sample Code for XOR implementation
RD (2, a3) (4, a5) (6, pA)
25
To Read from failed drive With other failed drive having
1. Non-degraded mode (no drive failures)
// Example: Drives 1 & 5 failed, read A2
To Read (any siZe), use non-XOR implementation above.
// Drive 5 has set 2 member E2, but no members ofA
To perform a Very Large Write (including complete parity
RD( 2, a3) (3, a4) (4, a5) (6, pA)
a2eXOR(a3, a4, a5, pA)
To Read from failed drive With other failed drive having
a member of both its parity sets, regenerate a missing parity
35
// Example: Drives 1 & 2 failed, read A2
// Drive 2 has ‘A’ member A3, and ‘2’ member D2.
// The 3 set cannot be regenerated (‘3’ parity unit is on
arrangement), use the non-XOR implementation above.
To perform a Full Stripe Write (a complete roW) use the
folloWing steps: (1) Use the data to calculate the neW parity
roW; (2) Use XDWrite commands to Write the data to disk;
(3) use XDReads to get the XOR data from the XOR buffers;
(4) Write out the parity unit for the roW; and (5) use
XPWrites to update the parity units for the column parities.
// Example: Write A2—A5 (full stripe)
XW(1, A2) (2, A3) (3, A4) (4, A5)
drive 1),
PAeXOR(A2, A3, A4, A5)
// so D2 must be regenerated.
// The D roW is also missing D1, so D1 must be regen
erated from parity set 1.
// E1 is on drive 2, it must be regenerated from parity set
This section illustrates hoW the XOR commands
(excluding special XOR commands) can be used to expedite
the XOR parity implementation.
a member of one parity set, use the other set to regenerate.
set member to regenerate read.
tiple regenerations are required, the implementation could
read the entire extent, and then perform all necessary cal
RD (2, e 1) (4, pE) (5, e2) (6, e6)
e4eXOR(e1, e2, e6, pE)
a2eXOR (a3, a4, a5, pA)
drives 1 and 2 out, but most accesses have much more
modest reconstruction requirements. In a case Where mul
45
E.
WR(6, A5)
aaQXP(3, P2)
bbQXP(1, P3)
ccQXP(5, P4)
// E4 is on drive 1, it must be regenerated from parity set
4, Which is intact.
RD (3, a4) (4, f4) (5, p4) (6, b4)
RD (4, pE)(5, e2)(6, e6)
eleXOR (e2, e4, e6, pE)
RD(3, c1) (5, pl) (6, f1)
d1eXOR(c1, e 1, f 1, p1)
55
To perform a Partial Stripe Write (not a complete roW) use
the folloWing steps: (1) Use XDWrites to Write the neW data
to the disks; (2) use XDReads to get the contents of the XOR
buffer; (3) use XPWrite to the appropriate column parities;
RD(3, d6)(4, d3)(6, pD)
(4) XOR all of the XOR buffers to get a buffer for the roW
d2eXOR(d 1, d3, d6, pD)
RD (3, p2) (4, c2) (5, e2)
a2eXOR(c2, d2, e2, p2)
parity; and (5) XPWrite that buffer to the roW parity.
// Example: Write A2—A4 (3/4 of a stripe)
The depth of the multiple levels of regeneration
(resolvable dependencies) depends, to some extent, on the
arrangement. Table 22 shoWs the regeneration requirements
of a tWo-drive failure, With the ?rst failed drive being
number 1, the second drive listed in the horiZontal axis, and
the number of regenerations required in the vertical axis.
The number of data stripe units in the array is 24.
65
ddeXOR (aa, bb, cc)
eaaQXP (3, P2)
bbQXP( 1, P3)
US 6,353,895 B1
19
20
2. Degraded mode (single drive failure)
ccQXP(5, P4)
ddQXP(6, PA)
To Read from a failed drive, send regenerate command to
parity drive for the desired stripe.
// Example: Drive 1 failed, read A2
To perform an Update Write (small Writes) use the fol
lowing steps: (1) Use XDWrite to Write the neW data to the
block; (2) Use XDRead to get the results of the XOR buffer;
and (3) Use XPWrite to Write the update information to both
RG(6, pA, [2, a3], [3, a4], [4, a5])
parities.
To Write to a failed drive, regenerate block as above, then
// Example: Write A2
XW(1, A2)
folloW update Write procedures.
10
// Example: Drive 1 failed, Write A2
RG(6, pA, [2, a3], [3, a4], [4, a5])
aaQXP(3, P2)
aaQXP(6, PA)
aaeXOR(a2, A2)
2. Degraded mode (single drive failure)
To Read from a failed drive, use the non-XOR procedure
above.
15
unit.
To Write to failed drive, Regenerate stripe unit, then
// Example: Drive 6 failed, Write to A2
folloW update Write procedures.
XX (1, A2, [3, P2])
// Example: Drive 1 failed, Write A2
RD (2, a3) (3, a4) (4, a5) (6, pA)
a2eXOR(a3, a4, a5, pA)
aaeXOR(a2, A2)
aaQXP(3, P2) (6, PA)
To Write to drive With failed parity unit, ignore the failed
aaQXP (3, P2) (6, PA)
To Write to a failed parity drive, ignore the failed parity
3. Double degraded mode
To Write to failed drive With failed parity drive, if the
remaining parity set has no members on the failed parity
drive, use rebuild procedure above; otherWise, regenerate
enough old data to perform an update operation on remain
25
ing parity drive.
// Example one: Drives 1 and 6 failed, Write to A2
parity unit.
// Example: Drive 6 failed, Write A2
A2QRB (3, P2, [2, d2], [4, c2], [5, e2]
XW (1, A2)
// Regenerate E4
// Example tWo: Drives 1 & 3 failed, Write A2
RG(4, pE, [2, e1], [5, e2], [6, e6])
aaQXP(3, P2)
3. Double degraded mode
// Regenerate A4
Without third party commands, these procedures Will all
e4QRG (5, P4, [4, f4], [6, b4])
be performed in much the same Way as the non-XOR
procedure above.
35
// Update operation
aaeXOR (A2, a4)
aaQRB (6, PA, [2, a3], [4, a5])
4. Rebuilding mode
To rebuild, use the non-XOR procedures above.
C. Sample Code for XOR With third party commands
1. Non-degraded mode (no drive failures)
To Write to an intact drive With both parity drives gone,
To Read (any siZe) use the same procedure as the non
XOR commands above.
simply perform the Write.
// Example: Drives 3 & 6 failed, Write A2
To perform a Very Large Write (including complete parity
WR(1, A2)
arrangement) use the non-XOR commands above.
To Read from a failed drive With one parity drive failed,
To perform a Stripe Write (including a complete roW) use
the folloWing steps: (1) Perform an XOR on all neW data
45
set.
blocks in roW; (2) Write to stripe parity; and (3) Write out
// Example one: Drives 1 & 6 failed, read A2
// In this example, all other stripe units in the parity set are
available
neW blocks using XDWriteExtended to update column par
ity.
// Example—Write A2—A5.
RG (3, p2, [2, d2], [4, c2], [5, e2])
XX(1,A2,[3, P2]) (2,A3,[1, P3]) (3, A4, [5, P4]) (4, A5,
[3, P5])
WR(6, PA)
To perform a Partial Stripe Write (not a complete roW) use
the folloWing steps: (1) Read all roW stripe units not over
Written; (2) Calculate roW parity With old and neW stripe
units; (3) Write to roW parity; and (4) Write out neW stripe
reconstruct any stripe units needed for the remaining parity
// Example tWo: Drives 1 & 3 failed, read A2
// In this example the ‘A’ set must be used to reconstruct.
55
// HoWever, A4 is not available (it’s on drive 3).
// A4 must be reconstructed from the ‘4’ set, Which
requires E4 (it’s on drive 1)
// E4 must be reconstructed from the E set, Which is
otherWise intact
units using XDWriteExtended to update column parity.
// Example: Write A2—A4 (3/4 of a stripe)
// Regenerate E4
RD(4, a5)
RG (4, pE, [2, e 1], [5, e2], [6, e6])
PAeXOR(A2, A3, A4, a5)
XX(1, A2, [3, P2])(2, A3, [1, P3]) (3, A4, [5, P4])
e4+XR(#4#)
WR(6, PA)
// Regenerate A4
65
To Update Write (small Writes) use the XOR procedure
above.
// Regenerate A2
US 6,353,895 B1
21
a4QRG(6, pA, [2, a3], [4, a5])
22
across said disk array by arranging said roW and
column XOR parity sets such that a ?rst data block on
said ?rst disk drive exists in a ?rst roW parity set and
a ?rst column parity set, and Wherein no other data
block on any of said plurality of disk drives exists in
both said ?rst roW parity set and said ?rst column parity
To Read from failed drive With other failed drive having
a member of one parity set, use the other set to regenerate.
// Example: Drives 1 & 5 failed, read A2
// Drive 5 has set 2 member E2, but no members ofA
set.
RG(6, pA, [2, a3], [3, a4], [4, a5])
To Read from failed drive With another failed drive having
a member of both its parity sets, regenerate a missing parity
10
recover data lost due to a failure of any one disk drive in said
set member to regenerate read.
// Example: Drives 1 & 2 failed, read A2
// Drive 2 has ‘A’ member A3, and ‘2’ member D2
// The 3 set cannot be regenerated (‘3’ parity unit is on
2. The redundant array of independent disk drives of
claim 1, Wherein said ?rst data block is a stripe unit.
3. The redundant array of independent disk drives of
claim 1, Wherein said array controller is con?gured to
disk array.
15
drive 1),
4. The redundant array of independent disk drives of
claim 1, Wherein said controller is con?gured to recover data
lost due to a failure of any tWo disk drives in said disk array.
5. The redundant array of independent disk drives of
claim 1, Wherein said controller is further con?gured to
reduce reconstruction interdependencies betWeen said ?rst
// so D2 must be regenerated
// The D roW is also missing DI, so DI must be regenerated
disk block and a second disk block.
6. A method for providing tWo-drive fault tolerance in a
from parity set 1.
// E1 is on drive 2, it can be regenerated from parity set
redundant array of independent disk drives, comprising the
steps of:
E.
// E4 is on drive 1, it can be regenerated from parity set
organiZing a plurality of disk drives into a plurality of
4, Which is intact.
25
stripes, each of the plurality of stripes comprises a
plurality of stripe units, Wherein each stripe unit is
located on a single disk drive; and
arranging said stripe units into a plurality of XOR parity
sets, each of said plurality XOR parity sets comprises
a plurality of stripe units as members, said plurality of
XOR parity sets comprises a plurality of roW parity sets
and a plurality of column parity sets such that each
stripe unit exists in a parity set pair, said parity set pair
35
parity sets comprises one or more data members and one or
more parity members, said parity members calculated as an
exclusive-or of said data members.
8. The method of claim 7, Wherein each of said disk drives
least for a stripe unit With so many dependencies.
4. Rebuilding mode
The rebuilding command can expedite rebuilding, espe
cially When recovering from a single drive failure. The
comprises data members and parity members.
9. The method of claim 6, Wherein each of said parity
rebuild command is used to reconstruct a stripe unit and
commit it to disk.
members exists only one XOR parity set.
45
RB ( 1, A2, [2, a3], [3, a4], [4, a5], [6, pA])
11. A redundant array of independent disk drives con?g
ured to recover data lost due to a failure of any tWo disk
some detail for purposes of clarity of understanding, it Will
be apparent that certain changes and modi?cations may be
practiced by persons skilled in the art Within the scope of the
appended claims. Accordingly, the present embodiments are
drives in the array, comprising:
a disk array comprising a ?rst disk drive and a plurality of
disk drives; and
to be considered as illustrative and not restrictive, and the
appended claims.
55
What is claimed is:
1. A redundant array of independent disk drives Wherein
a tWo-dimensional XOR parity arrangement provides tWo
an array controller operatively coupled to said disk array,
said array controller con?gured to calculate roW XOR
parity sets and column XOR parity sets, said array
controller is further con?gured to distribute said roW
XOR parity sets and column XOR parity sets across
said disk array to reduce reconstruction interdependen
cies for data reconstructed from said roW parity sets and
said column parity sets, said interdependencies being
drive fault tolerance, comprising:
reduced by arranging said roW and column parity sets
a disk array comprising a ?rst disk drive and a plurality of
such that a ?rst stripe on said ?rst disk drive exists in
a ?rst roW parity set and a ?rst column parity set, and
Wherein no other stripe on any of said plurality of disk
drives exists in both said ?rst roW parity set and said
disk drives; and
an array controller operatively coupled to the disk array,
said array controller is con?gured to calculate roW
XOR parity sets and column XOR parity sets, said
array controller is further con?gured to distribute said
roW XOR parity sets and said column XOR parity sets
10. The method of claim 6, further comprising the step of
analyZing said arrangement to detect cyclic dependencies.
Although the foregoing invention has been described in
invention is not to be limited to the details given herein, but
may be modi?ed Within the scope and equivalents of the
pair.
7. The method of claim 6, Wherein each of said XOR
In some situations, it is more ef?cient (i.e., faster) to read
the entire stripe map and manually reconstruct the data, at
// Example: Rebuild A2 to disk.
comprising a roW parity set and a column parity set, and
Wherein no tWo stripe units exist in the same parity set
65
?rst column parity set.
*
*
*
*
*
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement