US006353895B1 (12) United States Patent (10) Patent No.: (45) Date of Patent: Stephenson (54) RAID ARCHITECTURE WITH TWO-DRIVE FAULT TOLERANCE (75) Inventor: Dale J. Stephenson, Tracy, CA (US) 6,223,323 B1 * Notice: Mar. 5, 2002 4/2001 Wescott OTHER PUBLICATIONS M. Blaum, J. Brady, J. Bruck, and J. Menoa, “EVENODD: An Ef?cient Scheme for Tolerating Double Disk Failures in (73) Assignee: Adaptec, Inc., Milpitas, CA (US) (*) US 6,353,895 B1 RAID Architectures, ” 4—94, IEEE Transaction on Comput ers, vol. 44, No. 2. Subject to any disclaimer, the term of this patent is extended or adjusted under 35 * cited by examiner U.S.C. 154(b) by 0 days. Primary Examiner—Gopal C. Ray (21) Appl. No.: 09/250,657 Feb. 16, 1999 (22) Filed: (74) Attorney, Agent, or Firm—Martine & Penilla, LLP (57) ABSTRACT A two-dimensional parity arrangement that provides tWo Related US. Application Data (60) Provisional application No. 60/075,273, ?led on Feb. 19, drive fault tolerance in a RAID system is presented. The 1998. parity arrangement uses simple exclusive-or (XOR) parity (51) Int. Cl.7 ....................... .. G06F 11/00; G06F 17/30; (52) (58) US. Cl. .............................. .. 714/5; 714/6; 711/114 Field of Search ..................... .. 714/6, 7, 767, codes rather than the more complex Reed-Solomon codes used in a conventional RAID 6 implementation. User data on the physical disk drives in the RAID system is arranged into XOR roW parity sets and XOR column parity sets. The XOR parity sets are distributed across the physical disk drives by arranging the parity sets such that the data on each G11B 5/00 714/770, 800, 5; 711/114, 100, 111; 707/202; 709/214 physical drive exists in tWo separate parity sets, With no stripe unit in the same tWo sets. The storage lost due to parity is equal to the capacity of tWo drives, or 2/N the total capacity of an N-drive array. Accordingly, this parity References Cited (56) US. PATENT DOCUMENTS 5446 855 A * 8/1995 D t 1 arrangement uses less storage than mirroring When the 5’774’ 6 41 A * 6/1998 lsiilngl it 21' number of total drives is greater than four. 6,138,125 A * 10/2000 DeMoss 6,219,800 B1 * 4/2001 Johnson et al. 11 Claims, 6 Drawing Sheets 1500* START I 60h! A 2amcaiz'astztwacrs,at“ I 602x LOOP FOR ALL TWO DRIVE FAILURE COMBINATIONS. STEPS (603*610) CONSTRUCT AN ARRAY FOR EACH FAILED DISK WITH A FIELD FOR EACH ROWv SET EACH FIELD TO FALSE [0] I FOR EACH PARITY (COLUMN OR ROW) UNIT ON AN AFFECTED DRIVE SET THE FIELD TO TRUE. I 506 12W) 507/ 605 609/ LOOP: FOR EACH UNAFFECTED PARITY SET (FROM STEP EOI), LOOK FOR A PARITY SET MEMBER ON THE OTHER DRIVE, IF THE PARITY SET MEMBER IS ALREADY TRUE OR IS A PARITY UNIT, GO ON TO THE NEXT PARITV SET. OTHERWISE, DO STEP 505. WHEN ALL PARITY SETS HAVE BEEN DONE, GO TO STEP 607 I MARK THE ROW FOR THE PARITY SET MEMBER TRUE, THE DISK BLOCK JUST MARI\ED TRUE ALSO EIELONGS TO ANOTHER PARITY SET. LOOK FOR ANOTHER MEMBER OF THAT PARITY SET ON THE OTHER DRIVE, IF THERE AREN'T ANY, OR THAT MEMBER IS A PARITV BLOCK, OR THAT MEMBER IS ALREADY TRUE, GO BACK TO STEP 605. OTHERWISE, REPEAT STEP 606 I CONTINUE I CHECK THROUGH THE ROWS FOR ANY ROWS THAT ARE STILL FALSE. IF THERE AREN'T ANY, GO BACK TO STEP 602 I CHOOSE AN UNMARKED ROW TO BEGIN CONSTRUCTING A DEPENDENCY CHAIN, AND ADD THE STRIPE UNIT IT REPRESENTS TO A DEPENDENCY LIST. I IF AN UNMARKED ROW IN THE OTHER DRIVE SHARES A PARITY SET OR ROW WITH THE CURRENT UNMARKED ROW, MARK IT TRUE, MAKE IT CURRENT, AND ADD ITS STRIPE UNIT TO THE DEPENDENCY LIST, REPEAT STEP 609 AS NECESSARY I INCREMENT THE NUMBER OF DEPENDENCIES. IF UNMARKED ROWS STILL 6/0 EXIST, GO BACK TO 505 AND START A NEW CHAIN, I CONTINUE IF A STRIPE UNIT HAS A DEPENDENCY VALUE OF 0x09, IT BELONGS TO DEPENDENCY CHAINS ou<<o> AND 3(I<<3) I EXIT I IE I U.S. Patent Mar. 5,2002 Sheet 1 0f 6 US 6,353,895 B1 //02 COMPUTER 706 SYSTEM I DISK 707 CONTROLLER 704’ . U.S. Patent Mar. 5,2002 Sheet 2 0f 6 US 6,353,895 B1 mg. AQ N #5 ; [email protected] MSW NQ/ N WGmD NHQ U.S. Patent mmEm MQFC Mar. 5,2002 xO IE Tm x0 4 Nan Sheet 3 0f 6 / m/m / ) / / / V35E56PS05 Q68 mFw Ew US 6,353,895 B1 3 gm MJQ m:5MSm5&5 U.S. Patent Mar. 5,2002 Sheet 4 0f 6 US 6,353,895 B1 @ M FIND A RARITY ARRANGEMENT FOR AN N—DRIVE ARRAY SATISFYING CRITERIA 1 THROUGH 3 l /\404 ANALYZE THE PARITY ARRANGEMENT TO FIND THE NUMBER OF UNRESOLVED DEPENDENCIES J K406 ANALYZE THE ARRANGEMENTS WITH ZERO UNRESOLVED DEPENDENCIES TO FIND THE ONE WITH LEAST RECONSTRUCTION OVERHEAD W634 U.S. Patent Mar. 5,2002 Sheet 5 0f 6 US 6,353,895 B1 F/GZ 5 MAKE THE FIRST STRIPE UNIT THE CURRENT BLOCK I > 502 INCREMENT THE PARITY SET OF THE CURRENT BLOCK I 505 IF THE PARITY SET EXCEEDS THE NUMBER OF DRIVES, SKIP TO STEP 511 I II 504 IF THE PARITY SET MATCHES ANY OTHER STRIPE UNITS IN THE ROW, GO TO STEP 502 I II 505 IF THE PARITY SET MATCHES ANY OTHER STRIPE UNITS IN THE COLUMN [DRIVE], GO TO STEP 502 I II .505 IF THE PARITY UNIT FOR THIS PARITY SET IS ON THE SAME COLUMN [DRIVE], 60 TO STEP 502 I 507 THE PARITY SET DOES NOT CONFLICT, SO INCREMENT THE CURRENT BLOCK I 505 IF THE CURRENT BLOCK STILL REPRESENTS A VALID STRIPE UNIT, SKIP TO STEP 510 I 509 THE PARITY SET IS A COMPLETE BIrI'IyAWQW-IZANALYZE THEeU-HAMQWAI AND SET THE cURRENT BLOCK TO THE VALUE RETURNED FROM THE ANALYSIS FUNCTION. I II 5/0 GO BACK TO STEP 502 5// SET THE PARITY SET FOR THE CURRENT BLOCK TO ZERO I II 5/2 DECREMENT THE CURRENT BLOCK I II .573’ IF THE CURRENT BLOCK IS STILL A VALID STRIPE UNIT, GO BACK TO STEP 502 I ALL COMBINATIONS HAVE BEEN CONSIDERED. END 5/4 U.S. Patent Mar. 5,2002 Sheet 6 6f 6 US 6,353,895 B1 START I A MATRIX IS CONSTRUCTED WITH THE PARITY 50/“ 502 SETS NOT AFFECTED FoR EACH DRIvE. I '\ LOOP FOR ALL TWO DRIVE FAILURE ' ’ COMBINATIONS, STEPS (603-610) I 60.3- CONSTRUCT AN ARRAY FOR EACH FAILED DISK WITH A FIELD FOR EACH ROW, SET EACH FIELD TO FALSE [0] I FoR EACH PARITY (COLUMN OR ROW) UNIT ON AN 504'“\ AFFECTED DRIvE SET THE FIELD TO TRUE. I 605~\ LOOP: FOR EACH UNAFFECTED PARITY SET (FROM STEP 60-1), LOOK FOR A PARITY SET MEMBER ON THE OTHER DRIVE. IF THE PARITY SET MEMBER IS ALREADY __ TRUE OR IS A PARITY UNIT, GO ON TO THE NEXT PARITY SET. OTHERWISE, DO STEP 606. WHEN ALL PARITY SETS HAVE BEEN DONE, GO TO STEP 607 I MARK THE ROW FOR THE PARITY SET MEMBER TRUE, THE DISK BLOCK JUST MARKED TRUE ALSO BELONGS TO ANOTHER PARITY SET. LOOK FOR ANOTHER MEMBER OF THAT PARITY SET ON THE OTHER DRIVE. IF THERE AREN'T ANY, OR THAT MEMBER IS A PARITY BLOCK, OR THAT MEMBER IS ALREADY TRUE, GO BACK TO STEP 605. OTHERWISE, REPEAT STEP 606 I CONTINUE 607_ \ CHECK THROUGH THE ROWS FOR ANY ROWS THAT ARE STILL FALSE. IF THERE AREN'T ANY, GO BACK TO STEP 602 I 6 _ 05 \ CHOOSE AN UNMARKED ROW TO BEGIN CONSTRUCTING A DEPENDENCY CHAIN, AND ADD THE STRIPE UNIT IT REPRESENTS TO A DEPENDENCY LIST. I 509» IF AN UNMARKED ROW IN THE OTHER DRIVE SHARES A PARITY SET OR ROW WITH THE CURRENT UNMARKED ROW, MARK IT TRUE, MAKE IT CURRENT, AND ADD ITS STRIPE UNIT TO THE DEPENDENCY LIST, REPEAT STEP 609 AS NECESSARY I 570' INCREMENT THE NUMBER OF DEPENDENCIES, IF UNMARKED ROWS STILL EXIST, GO BACK TO 608 AND START A NEW CHAIN, I 622‘ 6'77 —\ CONTINUE IF A STRIPE UNIT HAS A DEPENDENCY VALUE OF 0x09, IT BELONGS TO DEPENDENCY CHAINS 0(1<<o) AND 3(1<<3) I 672 \ EXIT .I US 6,353,895 B1 1 2 RAID ARCHITECTURE WITH TWO-DRIVE FAULT TOLERANCE or (XOR) results of all data blocks in the parity disks roW. The Write bottleneck is reduced because parity Write opera tions are distributed across multiple disks. CROSS REFERENCE TO RELATED APPLICATIONS The RAID 6 architecture is similar to RAID 5, but RAID 6 can overcome the failure of any tWo disks by using an additional parity block for each roW (for a storage loss of The present application claims priority bene?t of US. Provisional Application No. 60/075,273, ?led Feb. 19, 1998. 2/N). The ?rst parity block (P) is calculated With XOR of the data blocks. The second parity block (Q) employs Reed BACKGROUND OF THE INVENTION 1. Field of the Invention The disclosed invention relates to architectures for arrays of disk drives, and more particularly, to disk array architec tures that provide tWo-drive fault tolerance. 2. Description of the Related Art Solomon codes. RAID 6 provides for recovery from a tWo-drive failure, but at a penalty in cost and complexity of the array controller because the Reed-Solomon codes are complex and may require signi?cant computational resources. The complexity 15 A Redundant Array of Independent Disks (RAID) is a storage technology Wherein a collection of multiple disk of Reed-Solomon codes may preclude the use of such codes in softWare and may necessitate the use of expensive special purpose hardWare. Thus, implementation of Reed-Solomon codes in a disk array increases the cost and complexity of the drives is organized into a disk array managed by a common array controller. The array controller presents the array to the array. Unlike the simpler XOR codes, Reed-Solomon codes cannot easily be distributed among dedicated XOR proces user as one or more virtual disks. Disk arrays are the sors. frameWork to Which RAID functionality is added in func tional levels to produce cost-effective, highly available, high-performance disk systems. SUMMARY OF THE INVENTION RAID level 0 is a performance-oriented striped data mapping technique. Uniformly siZed blocks of storage are 25 (rather than Reed-Solomon codes). The XOR parity stripe assigned in a regular sequence to all of the disks in the array. units are distributed across the member disks in the array by RAID 0 provides high I/O performance at loW cost. Reli separating parity stripe units from data stripe units. In one ability of a RAID 0 system is less than that of a single disk embodiment, the number of data stripe units is the same as drive because failure of any one of the drives in the array can result in a loss of data. the square of tWo less than the number of drives (i.e., (N—2 * N—2)). Each data stripe unit is a member of tWo separate parity sets, With no tWo data stripe units sharing the same RAID level 1, also called mirroring, provides simplicity and a high level of data availability. A mirrored array tWo parity sets. Advantageously, the storage loss to parity includes tWo or more disks Wherein each disk contains an stripe units is equal to the sum of the dimensions, so this identical image of the data. A RAID level 1 array may use parallel access for high data transfer rates When reading. 35 RAID 1 provides good data reliability and improves perfor high cost. RAID level 2 is a parallel mapping and protection tech tolerance. The array includes tWo or more disk drives and a disk controller. Data recovery from a one or tWo drive failure nique that employs error correction codes (ECC) as a correction scheme, but is considered unnecessary because off-the-shelf drives come With ECC data protection. For this a result, RAID 2 is rarely used. RAID level 3 adds redundant information in the form of parity arrangement uses less storage than mirroring When the number of total drives is greater than four. One embodiment includes a redundant array of indepen dent disk drives that provides one-drive and tWo-drive fault mance for read-intensive applications, but at a relatively reason, RAID 2 has no current practical use, and the same performance can be achieved by RAID 3 at a loWer cost. As The present invention solves these and other problems by providing tWo-drive fault tolerance using simple XOR codes is accomplished by using a tWo-dimensional XOR parity arrangement. The controller is con?gured to calculate roW XOR parity sets and column XOR parity sets, and to 45 parity data to a parallel accessed striped array, permitting distribute the parity sets across the disks drives in the array. The parity sets are arranged in the array such that no data block on any of the disk drives exists in tWo roW parity sets or tWo column parity sets. In one embodiment, the controller is con?gured to reduce reconstruction interdependencies regeneration and rebuilding of lost data in the event of a betWeen disk blocks. single-disk failure. One stripe unit of parity protects corre sponding stripe units of data on the remaining disks. RAID BRIEF DESCRIPTION OF THE DRAWINGS 3 provides high data transfer rates and high data availability. Moreover, the cost of RAID 3 is loWer than the cost of mirroring since there is less redundancy in the stored data. RAID level 4 uses parity concentrated on a single disk to 55 alloW error correction in the event of a single drive failure (as in RAID 3). Unlike RAID 3, hoWever, member disks in The advantages and features of the disclosed invention Will readily be appreciated by persons skilled in the art from the folloWing detailed description When read in conjunction With the draWings listed beloW. FIG. 1 is a hardWare block diagram shoWing attachment a RAID 4 array are independently accessible. Thus RAID 4 is more suited to transaction processing environments of one or more disk drives to a computer system. involving short ?le transfers. RAID 4 and RAID 3 both have a Write bottleneck associated With the parity disk, because shoWing mapping of one or more physical disk drives to one or more logical drives. every Write operation modi?es the parity disk. FIG. 3 is a logical block diagram shoWing data striping, Wherein each logic block is equivalent to a stripe unit. FIG. 2 is a logical block diagram of a disk array system In RAID 5, parity data is distributed across some or all of the member disks in the array. Thus, the RAID 5 architecture achieves performance by striping data blocks among N disks, and achieves fault-tolerance by using 1/N of its storage for parity blocks, calculated by taking the exclusive 65 FIG. 4 is a ?oWchart shoWing an overvieW of the design process. FIG. 5 is a ?oWchart shoWing the processes steps of ?nding a column parity set. US 6,353,895 B1 3 4 FIG. 6 is a ?owchart showing the processes steps of analyzing a parity set to ?nd dependencies. physical blocks 1.1, 2.1, and 3.1. A third stripe, stripe 3, comprises physical blocks 1.3, 2.3, and 3.3. Logical blocks In the drawings, the ?rst digit of any three-digit number generally indicates the number of the ?gure in which the element ?rst appears. Where four-digit reference numbers are used, the ?rst two digits indicate the ?gure number. 0—2 are mapped into stripe 1 and logical blocks 6—8 are mapped into stripe 3. In many cases a user accessing data from the logical disks will access the logical disk blocks consecutively. The stripe mapping shown in FIG. 3 maps consecutive logical blocks DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT FIG. 1 is a hardware block diagram showing attachment 10 performance because the disk operations will tend to be of one or more disk drives to a computer system. In FIG. 1, a disk controller 104 is attached to a computer system 102. One or more disk drives 106—107 are provided to the more uniformly distributed across all of the available disk drives. controller 104. Typically, the disk controller communicates with a low level software program, known as a device driver, 15 running on the computer system 102. The device driver controls the operation of the disk controller 104 and directs the controller 104 to read and write data on the disks 106—107. As is well known, there may be more than one disk controller 104 that may either be external to or part of the tem such as RAID 3 and RAID 4, the array controller 208 106—107. The present invention provides a parity arrangement FIG. 2 is a logical block diagram of a disk array system showing mapping of the disk drives 106—107 in an array 210 of the physical drives 106—107 to the logical drives 209 is provide by an array controller 208 which may be imple mented in hardware, software, or both. The array controller 208 maps the physical drives 106—107 into logical disks 204—205 such that a computer user 202 only “sees” the logical disks 204—205 rather than the physical drives 106—107. The number of physical drives 106—107, and the siZe of the physical drives 106—107 may be changed without affecting the number and siZe of the logical drives 204—205. Several physical drives 106—107 may be mapped into a single logical drive. Conversely, one of the physical drives 106—107 may be mapped into several logical drives. In addition to mapping physical drives 106—107 to logical drives 204—205, the array controller provides data striping of the data on the physical drives 25 (including the parity unit) are spread across different physi 35 Parity data is provided in an N-by-N parity map within the array 210, where N is the number of physical drives, and the storage capacity is equal to N-2 drives. One parity set includes the stripe units on a given row (row parity), while its complementary parity set is a column (column parity) 45 drawn from N-2 different rows (and also N-2 different drives). The stripe units are also distributed in such a manner that they can be striped. An example for a our drive array having four stripes per drive (four-by-four) is shown in Table 1. The data on each of the four drives is shown in columns one through four. The data in each of the four stripes is shown in rows one through four. The four-by-four arrangement result in sixteen blocks, as shown. There are of the physical disks 106—107 actually receives the data. In eight blocks of actual user data, and eight parity blocks. Each data block has a physical location (i.e., its physical location on a disk drive) and a logical position (its position in the two-dimensional parity arrangement). Each data block order to balance I/O loads across the drives, the array controller will often map consecutive logical blocks across several physical drives, as shown in FIG. 3. FIG. 3 shows an address mapping scheme known as disk 55 drives are mapped into units known as stripes. For convenience, the present disclosure treats each stripe unit as is a member of two parity sets, a row parity set and a column parity set. Letters are used to denote the row parity for a data block and numbers are used to denote column parity for a data block. Parity blocks contain no user information, but rather, only parity information. Each parity block is a having only one block, with the understanding that a stripe member of only one parity set, either a row parity set or a may contain multiple blocks. FIG. 3 shows three member column parity set. In Table 1, parity blocks are shown in drives 301—303 in a disk array. Each member drive has three parentheses. physical disk blocks (a typical real-world disk drive would labeled 3.1, 3.2, and 3.3. A ?rst stripe, stripe 1, includes cal drives. Fourth, data is available after failure of any two of the physical drives 106—107. writes data to logical block 3, the user will not know which have tens of thousands of blocks). The physical blocks on member disk one 301 are labeled 1.1, 1.2, and 1.3. The physical blocks on member disk two 302 are labeled 2.1, 2.2, and 2.3. The physical blocks on member disk three 301 are tolerance is provided using simple exclusive-or (XOR) parity processing and also using 2/N of the physical drive space for parity encoding. The two-drive XOR parity arrangement can be described in terms of four criteria as The array controller 208 maps data address on the physi striping, wherein physical address blocks having the same physical address but residing, on different physical disk whereby the array controller 208 can correct for failure of any two of the physical drives 106—107. Two-drive fault follows. First, each stripe unit in the physical drives is a member of two different parity sets. Second, different stripe units have do not have common membership in both parity sets with another stripe unit. Third, members of a parity set 106—107, and the array controller 208 corrects errors due to the failure of one or more of the physical drives 106—107. cal drives 106—107 into logical address in the logical disks 204—205. Logical addresses are typically described in terms of logical blocks, numbered consecutively from 0 to N. Typically, the user 202 does not know how logical addresses map to physical addresses. Thus, for example, if the user 202 The extent to which the array controller 208 can correct for multiple drive failures depends, in part, on the redun dancy and/or parity (i.e., error correction) data stored on the physical drives 106—107. In a single dimension parity sys can correct errors due to failure of one of the physical disks computer system 102. into one or more logical disk drives 204—205. The mapping across different disk drives. Thus a user accessing logical blocks in a consecutive fashion will see improved I/O 65 For example, in Table 1, the block A2 is a data block containing user data. Physically, the block A2 resides in the ?rst stripe unit on the second drive. Logically, the block A2 is a member of the row parity set A, and is also a member of the column parity set 2. US 6,353,895 B1 6 Table 2 shoWs a ?ve drive arrangement Wherein all dependencies can be resolved. TABLE 1 Stripe Stripe Stripe Stripe 1 2 2 4 Drive 1 Drive 2 Drive 3 Drive 4 A3 (2) C1 (4) A2 (B) C4 (D) (1) B4 (3) D2 (A) B1 (C) D3 The arrangement shown in Table 1 visually ?ts the ?rst three criteria. First, each stripe unit (user data block) is a member of tWo different parity sets. Second, different stripe TABLE 2 A2 B1 (c) D4 (5) A5 (B) c3 (4) E1 (1) B3 c2 (D) E5 (A) B2 (3) D5 E4 1O With larger sets, involving more than four drives, it is possible to construct parity arrangements that satisfy the ?rst three criteria, but that have circular (unresolvable) depen dencies. A parity arrangement With circular dependencies units do not have common membership in both parity sets With another stripe unit. Thus, for example, there is only one block A2. Third, members of a parity set (including the parity unit) are spread across different physical drives. For example, the column parity set 1 is spread across drives 1, 3, and 4, the roW parity setAis spread across drives 1, 2, and A1 (2) c4 D3 (E) 15 Will have some data blocks that cannot be reconstructed after a tWo-drive failure. Consider, for example, the six drive arrangement shoWn in Table 3. 4. TABLE 3 With regards to the fourth criteria, for this simple arrangement, there are 48 different stripe-unit/drive combi drive nations to consider (eight different stripe units, six possible tWo-drive failure combinations). Forty of these can be handled by using surviving members, While eight have dependencies that require the reconstruction of another stripe unit. 25 Within an XOR parity set (either roW or column) the value of any block in the set is computed as simply the XOR 1 2 3 4 5 6 A2 B5 (3) D1 A3 B4 (C) D2 A4 (2) C1 D6 A5 (B) C2 D3 (1) B3 C6 (4) (A) BI C5 (D) E4 (6) E6 (5) F3 F1 E2 F5 E3 F4 (denoted by the symbol “GB”) of all of the other blocks in the set. Thus, for example, Table 1 shoWs a roW parity set “A” If the 4th and 6th drives failed, stripe units (blocks) D3 having members (A), A2 and A3. (Note that the block A2 is and E3 Would be unrecoverable. Neither D3 nor E3 can be reconstructed by use of the roW parity groups, since roW also a member of the column parity set “2”, and the block A3 is also a member of the column parity set “3”). The blocks A2 and A3 contain actual user data. The block (A) is the parity block. Thus, the folloWing relationships are all valid: parity units (D) and are on the failed drives. S0 D3 and E3 both Would need to be reconstructed by use of the column 35 parity set Recall that any one member of a parity set can be reconstructed from the other members of the set. If tWo members of a parity set are missing, then the set cannot be reconstructed. Both D3 and E3 are members of the same column parity set, set When the user Writes data to a disk block, the parity blocks corresponding to that disk block are recomputed. Thus, for example, if the user Writes neW data to the block A2, then the value of the roW parity block (A) is recomputed as (A)=A2G9A3 and stored, and the value of the column parity block (2) is recomputed as (2)=A2G9D2 and stored. With the values of (A) and (2) computed and saved as Thus reconstruction of D3 from the column parity set (3) requires that E3 be reconstructed ?rst (and vice versa). Thus, D3 and E3 cannot be recon structed. Constructing the Parity Arrangement As shoWn in Table 3 above, there are many possible 45 arrangements (schemes) of data and parity blocks. Although it is possible to construct parity sets that have circular dependencies, it is also possible to construct parity sets that have no circular dependencies. Clearly, the most desirable arrangements are arrangements that have no circular depen dencies. Even When an arrangement has no circular dependencies, there may be interdependencies (as in the case above Where above, then the value of A2 can be reconstructed if needed. If drive 2 (the drive containing A2) should fail, the value of A2 can be reconstructed from either of the folloWing rela tionships: If, however, the ?rst tWo drives (drives 1 and 2) in Table 1 fail, then both (2) and A3 are unavailable, since (2) is in a tWo-drive failure, A2 Was dependent on Interde pendencies create additional overhead When a block must be stored on drive 1 and A3 is stored on drive 2. As shoWn in 55 reconstructed. Thus, the most ef?cient parity arrangements are those arrangements that provide the loWest reconstruc the above equations, at least one of the value (2) or A3 is needed to reconstruct A2. Fortunately, A3 can be recon tion overhead (i.e., the arrangements that have the feWest structed from A3=(3) G9D3, because (3) is stored on drive 3 interdependencies). and D3 is on drive 4. Thus, A2 is dependent on A3 to survive this particular tWo-drive failure. If both drive 1 and drive 2 FIG. 4 is an overvieW ?oWchart of the identi?cation process. The process shoWn in FIG. 4 begins at a ?nd fail, A2 can be reconstructed by calculating A3=(3)G9D3 process block 402, Which includes ?nding a parity arrange folloWed by A2=(A)G9A3. ment for an N-drive array satisfying the ?rst three criteria above. The process then advances to a ?rst analysis block All of the dependencies in the four drive arrangement shoWn in Table 1 can be resolved. Thus, the failure of any tWo drives in Table 1 Will not result in the loss of data because the data in all of the blocks on the failed disks can be reconstructed. 65 404 Where the parity arrangement is analyZed to ?nd the number of unresolved dependencies. The process then advances to a second analysis block 406 Where the arrange ments Zero unresolved dependencies (found in process block US 6,353,895 B1 8 7 404) are analyzed to ?nd a parity arrangement With the lowest reconstruction overhead. In the process block 511, the process sets the parity set for the current block to Zero and advances to a process block In the ?nd process block 402, the process declares an 512. In the process block 512, the process decrements the integer matrix With siZe N><N. It is assumed that the stripe current block and advances to a process block 513. In the units should be striped, and also that each roW Will contain both a roW parity unit and a column parity unit. Furthermore, it is assumed that all stripe units in a given roW Will comprise the parity set for that roW. So the process begins by initial process block 513, if the current block is still a valid stripe unit, then the process jumps back to the process block 502; otherWise, the process advances to a process block 514. When the process reaches the process block 514, all of the possible combinations have been considered, and the pro iZing the matrix according to the folloWing pattern (example is a 6x6 array) as shoWn in Table 4. 10 cess exits. An optional de?ne can be used to insert a step 4a—If the TABLE 4 0 0 R 0 0 R 0 0 c 0 0 c 0 R 0 0 R 0 0 c 0 0 c 0 R 0 0 R 0 0 c 0 0 c 0 0 In Table 4, a value of 0 is a stripe unit not yet assigned to a column set, R represents the roW parity unit, and C represents a column parity set unit (internally, R and C are set to 0x80 and 0x40 respectively). Each C is associated With a parity set equal to its roW, the C in the ?rst roW belongs to set 1, the C in the second roW belongs to set 2, etc. If the roWs Were counted from 0, the unassigned blocks Would be set to —1 for this algorithm. 15 parity block for this parity set is in this roW, go to step 2. This is not a logical requirement, but can reduce the number of combinations considered. Another optional de?ne can be used to ?ll a number of blocks (generally the stripe units in the ?rst roW) With assigned parity sets, and terminating the program When it makes its Way back to that level. FIG. 6 is a ?oWchart shoWing the steps of analyZing the arrangement. After ?nding a parity arrangement that does not violate the ?rst three criteria the parity arrangement is analyZed for unresolvable (circular) tWo-disk failure depen dencies. The ?oWchart in FIG. 6 begins at a process block 601 Where a matrix is constructed With the parity sets not affected for each drive (every drive Will have tWo parity sets, 25 either column or roW, that do not appear on the drive). The process then advances to a loop block 602. The loop block In this example, the roW parity alWays precedes the column parity. An optional de?nition alloWs the order of R 602 provides tWo nested loops to iterate through each tWo-drive failure combination. A loop counter failil iter and C to alternate Within a parity arrangement. If an R and ates from 0 to N-2, and a loop counter faili2 iterates from faili1+1 to N-l. The ?rst process block inside the loop is C can share the same column (alWays the case With an odd number of drives), the sets they are produced from can have no stripe units in common (the program maintains a list of bad roW/column combinations to make sure the rule is not a process block 603 Where an array is constructed for each violated). The program then proceeds through the array according to 35 the ?oWchart shoWn in FIG. 5, beginning at a process block 501. In the process block 501 the process sets the ?rst stripe failed disk, With a ?eld for each roW. Each ?eld is initially set to 0 (false) to indicate that a stripe unit can be recon structed. The process then advances to a process block 604 Where, for each parity (column or roW) unit on an affected drive, the roW is set to 1 (true). The process then advances to a process unit as the current block and then advances to a process block 605, Where for each unaffected parity set (from the block 502. In the process block 502, the process increments process block 601), the process looks for a parity set member on the other drive. If the parity set member is already true, or is a parity unit, then the next parity set is checked; otherWise, the process advances to a process block 606. When all parity sets have been checked, the process the parity set of the current block and advances to a process block 503. In the process block 503, if the parity set exceed s the number of drives, then the process jumps forWard to a process block 511; otherWise, the process advances to a process block 504. In the process block 504, if the parity set matches any other stripe units in the roW, then the process returns to the process block 502, otherWise, the process advances to a process block 607. 45 advances to a process block 505. In the process block 505, if the parity set matches any other stripe units in the column (drive), then the process returns to the process block 502, otherWise, the process advances to a process block 506. In the process block 506, if the parity unit for this parity set is on the same column (drive), then the process returns to the process block 502; otherWise, the process advances to a process block 507. In the process block 507, it is assumed that the parity set does not con?ict, and the current block is incremented and the In the process block 606, the roW for the parity set member is marked (set) to true. The disk block just marked true also belongs to another parity set. The process block then looks for another member of that parity set on the other drive (this is folloWing a resolvable dependency). If there are none, or that member is a parity block, or that member is already true, then the process jumps back to the process block 605; otherWise the process repeats the process block 606. By the time the process reaches the process block 607, the 55 process has identi?ed all is the blocks that can be recon structed from the particular tWo-drive failure indicated by block 508, if the current block still represents a valid stripe unit, then the process jumps to a process block 510; failil and faili2. The process must noW search for blocks that cannot be reconstructed. The process ?rst checks through the roWs for any roWs that are still false. If there otherWise, the process advances to a process block 509. roWs that are false, the process advances to a process block Upon reaching the process block 509, the process has identi?ed a complete parity arrangement. In the process block 509, the process performs the analysis shoWn in 608; otherWise, the process jumps back to the process block process advances to a process block 508. In the process 602. If the process reaches process block 608, it means that there is a block that cannot be reconstructed (process block connection With FIG. 6 and sets the current block to the value returned from the analysis function. The process then advances to the process block 510. In the process block 510, the process jumps back to the process block 502. 65 606 already provided the blocks that can be reconstructed). Thus, the dependency chains identify members of circular dependencies, Which are used to shorten the searching US 6,353,895 B1 10 By counting the number of reconstructions, it is possible to identify the best Zero-dependency schemes. process. To ?nd dependencies, the process chooses an unmarked (still false) roW to begin constructing a depen dency chain. The stripe unit represented by the chosen roW Arrangements is added to a dependency list. The process then advances to The procedures listed above in connection With FIGS. 4, a process block 609. 5, and 6 identify many schemes With no circular or unre In the process block 609, if an unmarked roW in the other failed drive shares a parity set or roW With the current unmarked roW, the process marks the roW true, makes it current, and adds its stripe unit to the dependency list. The process loops in the process block 609 through all unmarked roWs and then advances to a process block 610. 10 In the process block 610, the process increments the number of circular or unresolvable dependencies. If unmarked roWs still eXist, the process jumps back to the process block 608 and starts a neW chain; otherWise, the process jumps back to the process block 602. Aprocess block 622 is the end block for the nested loops started in the process block 602. When the nested loops are complete, the process advances to a process block 611. When the process reaches the process block 611, all tWo drive failure combinations have been evaluated and the process has a list of blocks in a dependency for this 15 solved dependencies. In many cases, there are multiple solutions. For eXample, for a siX drive array, there are 29,568 distinct schemes that meet the desired criteria. A four-drive array is listed in Table 1 above. There are siXteen distinct four-drive schemes With the same parity unit placement and the same number of recon structions. The average number of reconstructions per failed stripe unit in a tWo-drive failure is 4/3, and the average number of reconstructions per stripe-unit in a tWo-drive failure is 2/3. There are tWo alternate four-drive parity schemes, shoWn in Tables 6 and 7, that evenly distribute column parity. Both of these schemes have 2/3 reconstructions per stripe unit in a tWo-drive failure, and 4/3 reconstructions per failed stripe unit in a tWo-drive failure. The four-drive schemes offer no capacity savings over a RAID 1 scheme, and are more arrangement. In one embodiment, the blocks are stored as a complex. HoWever, the four-drive schemes Will provide full binary tag for each stripe unit. If a stripe unit has a recovery after the loss of any tWo drives, Which RAID 1 cannot. The four drive array has 8 logical blocks. Table 8 dependency value of 09 (hexadecimal), it belongs to depen dency chains 0 (1<<0) and 3 (1<<3). A stripe unit With a tag 25 of 0 has no unresolved dependencies. If this parity arrange ment has feWer dependencies (or in a Zero-dependency case, shoWs hoW physical to logical addresses are mapped (i.e., hoW the array is stripped) in the four drive arrangements. TABLE 6 the same number) than any previous arrangement, the parity arrangement and the dependencies are saved to disk. Upon completion of the process block 611, the process advances A1 (2) c4 (D) to an eXit block 612, and eXits. The process block 612 returns a neW current block for the A2 (B) 03 (4) (1) B2 (c) D3 (A) B1 (3) D4 (1) B3 (c) D4 (A) B2 (3) D1 P 2 P 6 P 3 P 7 main iteration routine (all following stripe units are then cleared). Since the process searches for an arrangement With no dependencies, the process Will look at the highest block number in each dependency, and select the loWest of these highest blocks to return. Changing a later block in the 35 TABLE 7 A4 (2) c3 (D) arrangement Would not have the possibility of removing the dependency. For the eXample siX drive parity arrangement (D3/E3 dependency) described in connection With Table 3, the analysis routine Would return 17, the logical stripe unit A1 (B) 02 (4) number of E3. Block numbers With Zero circular dependen cies return the largest logical stripe unit number in the arrangement, plus one. Minimizing reconstruction TABLE 8 0 P 4 P 45 The process block 606 of the above algorithm is more than a method of identifying stripe units that may be reconstructed. The process block 606 also identi?es the steps 1 P 5 P A ?ve-drive (?ve-by-?ve) parity arrangement is shoWn in Table 9. This arrangement has 2/3 reconstructions per stripe required to actually recreate the stripe units. For any given number of drives, it is desired to select an arrangement that reduces the number of reconstruction involved in a drive unit in a tWo-drive failure, and 5/3 reconstructions per failed stripe unit in a tWo-drive failure. The ?ve-drive parity arrangement is striped as shoWn in Table 10. failure. The process block 606 maintains a counter to count the number of reconstruction steps involved. Each iteration in the process block 606 represents a reconstruction of a 55 stripe unit and a reconstruction that must occur to recon struct another stripe unit. For each tWo-drive failure, the number of reconstructions necessary is described by the folloWing algorithm: hop=(number of iterations reconstruction hops in the process block 606) TABLE 9 A2 B1 (c) D4 (5) A1 (2) c4 D3 (E) 0 5 1 P A5 (B) c3 (4) E1 (1) B3 c2 (D) E5 (A) B2 (3) D5 E4 P 3 P 4 total=(running count of reconstructions necessary for par ity arrangement) (after every step 6) While (hop) total+=hop--, end While TABLE 10 65 2 P US 6,353,895 B1 P 10 P 11 12 TABLE 10-continued TABLE 15 -continued 6 11 P 7 P 12 8 P 13 P 9 14 14 P 21 28 P P 15 22 29 P P 16 23 P 30 10 17 24 P 31 11 18 P 25 32 12 19 P 26 33 13 P 20 27 34 AsiX-drive (siX-by-siX) XOR parity arrangement is shown in Table 11. Data blocks are striped as shoWn in Table 12. This parity arrangement has 0.683 reconstructions per stripe An eight drive (eight-by-eight) parity arrangement is 10 unit in a tWo-drive failure, and 2.05 reconstructions per failed stripe unit in a tWo-drive failure. The parity algorithm shoWn in Table 16. This arrangement averages 0.70 recon structions per stripe unit in a tWo-drive failure, and 2.81 reconstructions per failed stripe unit in a tWo-drive failure. found 246 schemes With the same reconstruction demands. A slightly less efficient arrangement that evenly distributes There are no Zero dependency schemes that stripe the column parity sets across all columns. HoWever, a global 15 column parity is shoWn in Table 17. This parity arrangement uses 0.72 reconstructions per stripe unit in a tWo-drive striping effect can be instituted by reordering columns on the failure, and 2.87 reconstructions per failed stripe unit in a tWo-drive failure. An eight drive system is striped as shoWn in Table 18. disk. For example, every other siX-by-siX block could have the alternate parity arrangement shoWn in Table 13 to complement the arrangement shoWn in Table 11. 20 TABLE 11 A1 B4 (3) D2 E5 (6) A2 B5 (C) D6 E3 (F) A3 (2) C6 D4 (5) F1 TABLE 16 A4 (B) C1 D5 (E) F2 (1) B2 c5 (4) E6 F3 (A) B6 c3 (D) E1 F4 TABLE 12 A2 B6 C5 (4) E1 F7 G3 (8) 25 1 2 3 P P 6 7 P P 4 5 P 12 18 P P 13 19 P 8 14 P 20 9 15 P 21 10 P 16 22 11 P 17 23 (C) D6 E3 (F) A5 B7 (C) D8 E4 F3 (G) H1 A7 (2) C4 D3 E8 (6) G1 H5 A6 (B) C8 D2 E7 (E) G5 H3 (1) B3 C7 D6 (5) F2 G8 H4 (A) B4 C1 D7 (E) F8 G6 H2 TABLE 17 35 TABLE 13 A1 B4 A4 B8 (3) D1 E2 F5 (7) H6 30 0 A2 B5 A3 B5 C2 (D) E6 F1 G4 (H) A3 (2) (A) B6 (3) D2 A4 (B) C1 D5 C6 D4 C3 (D) (1) B2 C5 (4) E5 (6) (E) F2 (5) F1 E1 F4 E6 F3 A1 B2 C5 (4) E8 F7 G6 A2 B3 C1 (D) E7 F5 G4 A4 B1 (3) D7 E6 F8 (G) A3 B4 (C) D2 E5 F6 (7) A6 (2) C4 D1 E3 (F) G7 A8 (B) C3 D4 E2 (6) G5 (1) B8 C6 D5 (E) F4 G2 (A) B7 C8 D6 (5) F3 G1 (H) (8) H5 H1 H8 H7 H3 H2 40 TABLE 18 45 0 8 1 9 2 10 3 11 4 P 5 P P 6 P 7 16 P 17 P P 18 P 19 12 20 13 21 14 22 15 23 A seven drive (seven-by-seven) XOR parity arrangement 24 25 26 27 28 29 P P is shoWn in Table 14. The seven drive array is striped as 23 ii 3; 3; 3g 3: 2g 2; P P 42 43 44 45 46 47 shown in Table 15. This arrangement evenly stripes the parity across all columns. It requires 2/3 reconstructions per stripe unit in a tWo-drive failure, and 7/3 reconstructions per failed stripe unit in a tWo-drive failure. TABLE 14 50 Array Controller Programs In one embodiment, the tWo-dimensional XOR parity arrangements are implemented in array processor 208. 55 Operation of the array processor 208 is described in terms of “programs” Written using a disk command programming A3 A2 A1 A7 A6 (1) (A) 2% B1 (3) B7 (C) (2) C5 (B) B4 B3 language described beloW. The actual operations performed (D) E6 F5 D6 E5 F4 D5 E4 (6) D4 E3 (F) D3 (5) F1 D2 (E) F7 (4) E7 F6 by the controller 208 are implemented in hardWare, softWare, or both. The eXample programs beloW are directed 60 toWards a siX disk array using the parity arrangement shoWn (7) (G) G2 G1 G7 G6 G5 C4 C3 C2 _ _ in Table 19 and the data striping (physical to logical address mapping) shoWn in Table 20. TABLE 15 0 7 1 8 2 9 3 P TABLE 19 4 P P 5 P 6 65 A2 B5 A3 B6 A4 (2) A5 (B) (1) B3 (A) B4 US 6,353,895 B1 14 Disk control program for the tWo-drive XOR parity arrangement are provided beloW in three forms: (A) Without any XOR on the drives; (B) With XOR on the drives but TABLE 19-continued (3) D1 E4 (6) (C) D2 E1 (E) C1 D6 (5) F3 C2 D3 (E) F4 C6 (4) E2 F5 C5 (D) E6 F1 Without third party commands (e.g., XDWrite Extended, Regenerate, Rebuild); and, ?nally, (C) With XOR and third party commands. The programs beloW are described using the folloWing syntax: Code(Drive, Stripe Units, [Drive, Stripe Unit]) TABLE 20 0 6 P 12 18 P 1 7 P 13 19 P 2 P 8 14 P 20 3 P 9 15 P 21 10 P 4 10 P 16 22 P 5 11 P 17 23 parentheses represent a host command, and the square brackets (if present) represent a secondary command asso ciated With the XDWriteExtended, Rebuild, or Regenerate commands. Host buffers are represented as double loWer 15 In one embodiment, the tWo-set XOR parity arrangement is implemented in a disk array comprising a software disk-driver and six SeagateTM BarracudaTM FibrechannelTM drives. In addition to the normal read/Write primitives, the FibrechannelTM drives support a class of special XOR commands, Which perform XOR calculations in the disk electronics, maintain a separate XOR result buffer, and can interact directly With other drives. The special XOR com mands help facilitate RAID 5 operations. In the present application, the special XOR is commands are used to provide tWo-drive fault tolerance. Conventional disk drives Where Code is a command code listed in Table 20. The case letters (e.g., aa, bb, cc, etc), and the buffer in a disk drive is indicated by #drive# Where drive is an integer. Host XOR commands have only buffers in their list. Comments are indicated by a double slash “//”. The symbols “Q” and “e” denote data transfer. Multiple operations may be coded using one command code. For example, the commands to Write A1 and A2 to drive 1 and Write A3 to drive 2 can be coded as WR(1, A1, A2)WR(2, A3) or WR(1,A1, A2)(2,A3). 25 provide the folloWing primitive operations: NeW data in a data stripe unit is indicated by an uppercase initial letter, While old data is indicated With a loWercase initial letter. Parity units are in bold, and preceded With a P or p. For the purposes of the present description, and for convenience, operations are described in terms of Whole stripe units. HoWever, operations on less than Whole stripe Read: Read block(s) from disk storage into host memory. Write: Write block(s) from host memory into disk storage. units are Within the scope of the present disclosure. The special XOR commands provide the folloWing addi TABLE 20 tional primitive operations: Command Codes XDWrite: Write block(s) from host memory into disk storage. The disk performs an XOR betWeen the old and neW contents of disk storage before committing the Write. The Read Write 35 XDWrite, destructive XDWrite, non-destructive results of the XOR are retained in an XOR buffer, and are obtained by using an XDRead command. This command may be issued nondestructively, With the XOR being per XDRead XPWrite XDWriteExtended formed but the neW blocks not being Written to disk. XDRead: Reads the contents of a disk’s XOR buffer into host memory. XPWrite: Write block(s) from host memory into a disk’s buffer. The disk performs an XOR betWeen the old contents of disk storage and the buffer, and commits the results of the XOR to disk storage. XDWriteExtended: Write block(s) from host memory to disk storage. The disk performs an XOR betWeen the old and neW contents of disk storage before committing the Write. XDWriteExtended, non-destructive RB — Rebuild RG — Regenerate XOR - Host XOR 45 The ordering of commands is shoWn for convenience. In actual operation, commands may be reordered to take advan tage of asynchronous disk access, but operations involving the same disk Will typically be in the order listed. The parity arrangement for the folloWing examples is The disk also initiates an XPWrite command to another disk. Regenerate: The host sends the disk a list of blocks on one or more other disks to be used in conjunction With block(s) on the target disk. The target disk initiates reads to the other given in Table 21: TABLE 21 disks, and performs an XOR operation betWeen the blocks on all disks in the list, including itself. All transfers have the same number of blocks. The results of the XOR operations are retained in the XOR buffer to be obtained using an XDRead command. This command may also be issued With blocks from the host to be used in an additional XOR operation. Rebuild: The host sends the disk a list of blocks on one or more other disks to be Written to a location on the target disk. The target disk initiates reads to the other disks, performs an XOR operation betWeen all the blocks, and commits the results to disk storage. All transfers must have the same number of blocks. This command may also be issued With blocks from the host to be used in an additional XOR operation. Drive 55 1 2 3 4 5 6 A2 B5 P3 D1 EA P6 A3 B6 PC D2 E1 PF A4 P2 C1 D6 P5 F3 A5 PB C2 D3 PE F4 P1 B3 C6 P4 E2 F5 PA B4 C5 PD E6 F1 The examples are arranged according to the folloWing outline: A. Sample Code for Non-XOR implementation: 1. Non-degraded mode (no drive failures) 2. Degraded mode (single drive failure) US 6,353,895 B1 15 16 // EXample: Write A2—A4 (3/4 of a stripe) 3. Double degraded mode, tWo-drive failure 4. Rebuilding mode B. Sample Code for XOR implementation RD( 1, a2) (2, a3) (3, a4) (6, pA) RD( 1, p3) (3, p2) (5, p4) aaeXOR(A2, a2) bbeXOR(A3, a3) cceXOR(A4, a4) 1. Non-degraded mode (no drive failures) 2. Degraded mode (single drive failure) 3. Double degraded mode 4. Rebuilding mode C. Sample Code for XOR implementation With special PAeXOR (aa, bb, cc, pA) XOR commands 1. Non-degraded mode (no drive failures) 2. Degraded mode (single drive failure) 1O 3. Double degraded mode 4. Rebuilding mode A. Sample Code for Non-XOR implementation 1. Non-degraded mode (no drive failures) P4+XOR (cc, p4) WR(1, P3) (3, P2 ) (5, P4) 15 Read data. The array is fully striped, as in RAID 5. RD(1,a2) \\Reads one stripe unit RD(1,b5)RD(2,b6) \\Reads tWo stripe units Perform an Update Write (small Writes) using the folloW ing steps: (1) Read stripe unit to be Written; (2) read old roW and column parities; (3) Write out neW stripe unit; (4) XOR With old/neW data and parities to calculate update XOR; and (5) reWrite parities. Perform a Very Large Write (includes complete parity arrangement) using the folloWing steps: (1) Write out all //EXample: Write A2 RD(1, a2) (3, p2) (6, pA) data; (2) Perform XOR on all roWs to calculate parity A—F; (3) Write to disk; (4) Perform XOR on all column sets 1—6; and (5) Write to disk. PAeXOR(A2, A3, A4, A5) PBeXOR(B3, B4, B5, B6) PCeXOR(C1, C2, C6, C5) PDeXOR(D1, D2, D6, D3) PEeXOR(E2, E6, E4, E1) PFeXOR(F3, F4, F5, F1) P1eXOR(C1, D1, E1, E1) WR(1, A2) (2, A3) (3, A4) ( 6, PA) P2eXOR(aa, p2) P3eXOR(bb, p3) aaeXOR (A2,a2) PAeXOR (aa, pA) 25 P2eXOR(aa, p2) WR(1, A2) (3, P2) (6, PA) 2. Degraded mode (single drive failure) Perform a Read from failed drive using the folloWing steps: (1) Read all other members of the roW parity set (including the roW parity); and (2) Use XOR to regenerate the missing stripe unit. The column parity set could be used, but the roW stripe is more likely to take advantage of caching or of reads larger than the failed stripe unit. // EXample: Drive 1 failed, read A2 35 RD( 2, a3) (3, a4) (4, a5) (6, pA) a2eXOR (a3, a4, a5, pA) Perform a Write to failed drive using the folloWing steps: WR(1, A2, B5, P3, D1, E4, P6) WR(2, A3, B6, PC, D2,E1, PF) WR(3, A4, P2,C1, D6, P5, F3) WR(4, A5, PB,C2, D3, PE, F4) WR(5, P1, B3, C6, P4, E2, E5) WR(6, PA, B4,C5, PD, E6, E1) Perform a Stripe Write (includes complete roW) using the folloWing steps: (1) Read all data stripe units from roW; (2) (1) Regenerate striped unit as above; and (2) folloW update Write procedures. // EXample: Drive 1 failed, Write A2 45 Perform XOR on all neW data in roW; (3) Write to roW parity; (4) Write out neW data stripe units; (5) Read in column To perform a Write to drive With failed parity unit, perform a Write as normal but ignore the failed parity unit. parities; (6) perform update XOR; and (7) reWrite column parity. // EXample: Drive 6 failed, Write A2 // EXample—Write A2—A5. RD(1,a2) (2,a3) (3,a4) (4,a5) RD(I, P3) (3, P2, P5) (5, P4) PAeXOR(A2, A3, A4, A5) WR (1, A2) (2, A3) (3, A4) (4, A5)(6, PA) P2eXOR(a2, A2, p2) P3eXOR(a3, A3, p3) P4eXOR(a4, A4, p4) P5eXOR(a5, A5, p5) WR (1, P3) (3, P2, P5) (5, P4) 55 out neW stripe units; (4) XOR With old/neW data and parities to calculate update XOR; and (5) reWrite all parities. RD(1, a2) (3, p2) P2eXOR(a2, A2, p2) WR(1, A2) (3, P2) 3. Double degraded mode, tWo-drive failure Perform a Write to failed drive With failed parity drive using the folloWing steps: (1) Read the remaining parity set; (2) Recalculate the parity set from available stripe units; and (3) Write out the parity unit ignoring the failed parity drive. // EXample: Drives 1 and 6 failed, Write to A2 RD (2, d2) (4, c2) (5, e2) Perform a Partial Stripe Write (not a complete roW) using the folloWing, steps: (1) Read all stripe units to be overWrit ten; (2) Read stripe parity and all column parities; (3) Write RD( 2, a3) (3, a4, p2) (4, a5) (6, pA) a2eXOR(a3, a4, a5, pA) aaeXOR(a2, A2) PA<XOR(aa, pA) P2eXOR(aa, p2) WR(3, P2) (6, PA) WR(3, P2) 65 To Write to an intact drive With both parity drives gone, perform a normal Write. // EXample: Drives 3 and 6 failed, Write A2 US 6,353,895 B1 17 18 WR(1, A2) To Read from a failed drive With one parity drive doWn, TABLE 22 reconstruct any stripe units needed for remaining parity set. // Example one: Drives 1 & 6 failed, read A2 // In this example, all other stripe units in the parity set are available RD (2, d2) (3, p2) (4, c2),(5, e2) A2eXOR(d2, c2, e2, p2) // Example tWo: Drives 1 and 3 failed, read A2 // In this example the ‘A’ set must be used to reconstruct. The Worst case Would be repeated requests for A3 With // HoWever, A4 is not available (it is on drive 3). // A4 must be reconstructed from the ‘4’ set, Which requires E4 (it is on drive 1) 15 // E4 must be reconstructed from the E set, Which is otherWise intact culations Without requiring repeated reads. 4. Rebuilding mode In rebuilding mode, data is regenerated as if a drive has failed, and then Written directly to the drive. Parity units on the rebuilding drive are generated by XOR of all its set members. RD(4, f4) (5, p4) (6, b4) B. Sample Code for XOR implementation RD (2, a3) (4, a5) (6, pA) 25 To Read from failed drive With other failed drive having 1. Non-degraded mode (no drive failures) // Example: Drives 1 & 5 failed, read A2 To Read (any siZe), use non-XOR implementation above. // Drive 5 has set 2 member E2, but no members ofA To perform a Very Large Write (including complete parity RD( 2, a3) (3, a4) (4, a5) (6, pA) a2eXOR(a3, a4, a5, pA) To Read from failed drive With other failed drive having a member of both its parity sets, regenerate a missing parity 35 // Example: Drives 1 & 2 failed, read A2 // Drive 2 has ‘A’ member A3, and ‘2’ member D2. // The 3 set cannot be regenerated (‘3’ parity unit is on arrangement), use the non-XOR implementation above. To perform a Full Stripe Write (a complete roW) use the folloWing steps: (1) Use the data to calculate the neW parity roW; (2) Use XDWrite commands to Write the data to disk; (3) use XDReads to get the XOR data from the XOR buffers; (4) Write out the parity unit for the roW; and (5) use XPWrites to update the parity units for the column parities. // Example: Write A2—A5 (full stripe) XW(1, A2) (2, A3) (3, A4) (4, A5) drive 1), PAeXOR(A2, A3, A4, A5) // so D2 must be regenerated. // The D roW is also missing D1, so D1 must be regen erated from parity set 1. // E1 is on drive 2, it must be regenerated from parity set This section illustrates hoW the XOR commands (excluding special XOR commands) can be used to expedite the XOR parity implementation. a member of one parity set, use the other set to regenerate. set member to regenerate read. tiple regenerations are required, the implementation could read the entire extent, and then perform all necessary cal RD (2, e 1) (4, pE) (5, e2) (6, e6) e4eXOR(e1, e2, e6, pE) a2eXOR (a3, a4, a5, pA) drives 1 and 2 out, but most accesses have much more modest reconstruction requirements. In a case Where mul 45 E. WR(6, A5) aaQXP(3, P2) bbQXP(1, P3) ccQXP(5, P4) // E4 is on drive 1, it must be regenerated from parity set 4, Which is intact. RD (3, a4) (4, f4) (5, p4) (6, b4) RD (4, pE)(5, e2)(6, e6) eleXOR (e2, e4, e6, pE) RD(3, c1) (5, pl) (6, f1) d1eXOR(c1, e 1, f 1, p1) 55 To perform a Partial Stripe Write (not a complete roW) use the folloWing steps: (1) Use XDWrites to Write the neW data to the disks; (2) use XDReads to get the contents of the XOR buffer; (3) use XPWrite to the appropriate column parities; RD(3, d6)(4, d3)(6, pD) (4) XOR all of the XOR buffers to get a buffer for the roW d2eXOR(d 1, d3, d6, pD) RD (3, p2) (4, c2) (5, e2) a2eXOR(c2, d2, e2, p2) parity; and (5) XPWrite that buffer to the roW parity. // Example: Write A2—A4 (3/4 of a stripe) The depth of the multiple levels of regeneration (resolvable dependencies) depends, to some extent, on the arrangement. Table 22 shoWs the regeneration requirements of a tWo-drive failure, With the ?rst failed drive being number 1, the second drive listed in the horiZontal axis, and the number of regenerations required in the vertical axis. The number of data stripe units in the array is 24. 65 ddeXOR (aa, bb, cc) eaaQXP (3, P2) bbQXP( 1, P3) US 6,353,895 B1 19 20 2. Degraded mode (single drive failure) ccQXP(5, P4) ddQXP(6, PA) To Read from a failed drive, send regenerate command to parity drive for the desired stripe. // Example: Drive 1 failed, read A2 To perform an Update Write (small Writes) use the fol lowing steps: (1) Use XDWrite to Write the neW data to the block; (2) Use XDRead to get the results of the XOR buffer; and (3) Use XPWrite to Write the update information to both RG(6, pA, [2, a3], [3, a4], [4, a5]) parities. To Write to a failed drive, regenerate block as above, then // Example: Write A2 XW(1, A2) folloW update Write procedures. 10 // Example: Drive 1 failed, Write A2 RG(6, pA, [2, a3], [3, a4], [4, a5]) aaQXP(3, P2) aaQXP(6, PA) aaeXOR(a2, A2) 2. Degraded mode (single drive failure) To Read from a failed drive, use the non-XOR procedure above. 15 unit. To Write to failed drive, Regenerate stripe unit, then // Example: Drive 6 failed, Write to A2 folloW update Write procedures. XX (1, A2, [3, P2]) // Example: Drive 1 failed, Write A2 RD (2, a3) (3, a4) (4, a5) (6, pA) a2eXOR(a3, a4, a5, pA) aaeXOR(a2, A2) aaQXP(3, P2) (6, PA) To Write to drive With failed parity unit, ignore the failed aaQXP (3, P2) (6, PA) To Write to a failed parity drive, ignore the failed parity 3. Double degraded mode To Write to failed drive With failed parity drive, if the remaining parity set has no members on the failed parity drive, use rebuild procedure above; otherWise, regenerate enough old data to perform an update operation on remain 25 ing parity drive. // Example one: Drives 1 and 6 failed, Write to A2 parity unit. // Example: Drive 6 failed, Write A2 A2QRB (3, P2, [2, d2], [4, c2], [5, e2] XW (1, A2) // Regenerate E4 // Example tWo: Drives 1 & 3 failed, Write A2 RG(4, pE, [2, e1], [5, e2], [6, e6]) aaQXP(3, P2) 3. Double degraded mode // Regenerate A4 Without third party commands, these procedures Will all e4QRG (5, P4, [4, f4], [6, b4]) be performed in much the same Way as the non-XOR procedure above. 35 // Update operation aaeXOR (A2, a4) aaQRB (6, PA, [2, a3], [4, a5]) 4. Rebuilding mode To rebuild, use the non-XOR procedures above. C. Sample Code for XOR With third party commands 1. Non-degraded mode (no drive failures) To Write to an intact drive With both parity drives gone, To Read (any siZe) use the same procedure as the non XOR commands above. simply perform the Write. // Example: Drives 3 & 6 failed, Write A2 To perform a Very Large Write (including complete parity WR(1, A2) arrangement) use the non-XOR commands above. To Read from a failed drive With one parity drive failed, To perform a Stripe Write (including a complete roW) use the folloWing steps: (1) Perform an XOR on all neW data 45 set. blocks in roW; (2) Write to stripe parity; and (3) Write out // Example one: Drives 1 & 6 failed, read A2 // In this example, all other stripe units in the parity set are available neW blocks using XDWriteExtended to update column par ity. // Example—Write A2—A5. RG (3, p2, [2, d2], [4, c2], [5, e2]) XX(1,A2,[3, P2]) (2,A3,[1, P3]) (3, A4, [5, P4]) (4, A5, [3, P5]) WR(6, PA) To perform a Partial Stripe Write (not a complete roW) use the folloWing steps: (1) Read all roW stripe units not over Written; (2) Calculate roW parity With old and neW stripe units; (3) Write to roW parity; and (4) Write out neW stripe reconstruct any stripe units needed for the remaining parity // Example tWo: Drives 1 & 3 failed, read A2 // In this example the ‘A’ set must be used to reconstruct. 55 // HoWever, A4 is not available (it’s on drive 3). // A4 must be reconstructed from the ‘4’ set, Which requires E4 (it’s on drive 1) // E4 must be reconstructed from the E set, Which is otherWise intact units using XDWriteExtended to update column parity. // Example: Write A2—A4 (3/4 of a stripe) // Regenerate E4 RD(4, a5) RG (4, pE, [2, e 1], [5, e2], [6, e6]) PAeXOR(A2, A3, A4, a5) XX(1, A2, [3, P2])(2, A3, [1, P3]) (3, A4, [5, P4]) e4+XR(#4#) WR(6, PA) // Regenerate A4 65 To Update Write (small Writes) use the XOR procedure above. // Regenerate A2 US 6,353,895 B1 21 a4QRG(6, pA, [2, a3], [4, a5]) 22 across said disk array by arranging said roW and column XOR parity sets such that a ?rst data block on said ?rst disk drive exists in a ?rst roW parity set and a ?rst column parity set, and Wherein no other data block on any of said plurality of disk drives exists in both said ?rst roW parity set and said ?rst column parity To Read from failed drive With other failed drive having a member of one parity set, use the other set to regenerate. // Example: Drives 1 & 5 failed, read A2 // Drive 5 has set 2 member E2, but no members ofA set. RG(6, pA, [2, a3], [3, a4], [4, a5]) To Read from failed drive With another failed drive having a member of both its parity sets, regenerate a missing parity 10 recover data lost due to a failure of any one disk drive in said set member to regenerate read. // Example: Drives 1 & 2 failed, read A2 // Drive 2 has ‘A’ member A3, and ‘2’ member D2 // The 3 set cannot be regenerated (‘3’ parity unit is on 2. The redundant array of independent disk drives of claim 1, Wherein said ?rst data block is a stripe unit. 3. The redundant array of independent disk drives of claim 1, Wherein said array controller is con?gured to disk array. 15 drive 1), 4. The redundant array of independent disk drives of claim 1, Wherein said controller is con?gured to recover data lost due to a failure of any tWo disk drives in said disk array. 5. The redundant array of independent disk drives of claim 1, Wherein said controller is further con?gured to reduce reconstruction interdependencies betWeen said ?rst // so D2 must be regenerated // The D roW is also missing DI, so DI must be regenerated disk block and a second disk block. 6. A method for providing tWo-drive fault tolerance in a from parity set 1. // E1 is on drive 2, it can be regenerated from parity set redundant array of independent disk drives, comprising the steps of: E. // E4 is on drive 1, it can be regenerated from parity set organiZing a plurality of disk drives into a plurality of 4, Which is intact. 25 stripes, each of the plurality of stripes comprises a plurality of stripe units, Wherein each stripe unit is located on a single disk drive; and arranging said stripe units into a plurality of XOR parity sets, each of said plurality XOR parity sets comprises a plurality of stripe units as members, said plurality of XOR parity sets comprises a plurality of roW parity sets and a plurality of column parity sets such that each stripe unit exists in a parity set pair, said parity set pair 35 parity sets comprises one or more data members and one or more parity members, said parity members calculated as an exclusive-or of said data members. 8. The method of claim 7, Wherein each of said disk drives least for a stripe unit With so many dependencies. 4. Rebuilding mode The rebuilding command can expedite rebuilding, espe cially When recovering from a single drive failure. The comprises data members and parity members. 9. The method of claim 6, Wherein each of said parity rebuild command is used to reconstruct a stripe unit and commit it to disk. members exists only one XOR parity set. 45 RB ( 1, A2, [2, a3], [3, a4], [4, a5], [6, pA]) 11. A redundant array of independent disk drives con?g ured to recover data lost due to a failure of any tWo disk some detail for purposes of clarity of understanding, it Will be apparent that certain changes and modi?cations may be practiced by persons skilled in the art Within the scope of the appended claims. Accordingly, the present embodiments are drives in the array, comprising: a disk array comprising a ?rst disk drive and a plurality of disk drives; and to be considered as illustrative and not restrictive, and the appended claims. 55 What is claimed is: 1. A redundant array of independent disk drives Wherein a tWo-dimensional XOR parity arrangement provides tWo an array controller operatively coupled to said disk array, said array controller con?gured to calculate roW XOR parity sets and column XOR parity sets, said array controller is further con?gured to distribute said roW XOR parity sets and column XOR parity sets across said disk array to reduce reconstruction interdependen cies for data reconstructed from said roW parity sets and said column parity sets, said interdependencies being drive fault tolerance, comprising: reduced by arranging said roW and column parity sets a disk array comprising a ?rst disk drive and a plurality of such that a ?rst stripe on said ?rst disk drive exists in a ?rst roW parity set and a ?rst column parity set, and Wherein no other stripe on any of said plurality of disk drives exists in both said ?rst roW parity set and said disk drives; and an array controller operatively coupled to the disk array, said array controller is con?gured to calculate roW XOR parity sets and column XOR parity sets, said array controller is further con?gured to distribute said roW XOR parity sets and said column XOR parity sets 10. The method of claim 6, further comprising the step of analyZing said arrangement to detect cyclic dependencies. Although the foregoing invention has been described in invention is not to be limited to the details given herein, but may be modi?ed Within the scope and equivalents of the pair. 7. The method of claim 6, Wherein each of said XOR In some situations, it is more ef?cient (i.e., faster) to read the entire stripe map and manually reconstruct the data, at // Example: Rebuild A2 to disk. comprising a roW parity set and a column parity set, and Wherein no tWo stripe units exist in the same parity set 65 ?rst column parity set. * * * * *
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement