Vectorization of Reed Solomon Decoding and Mapping on the EVP Akash Kumar1 , Kees van Berkel1,2 1 Eindhoven University of Technology, Eindhoven, The Netherlands 2 NXP Semiconductors, Eindhoven, The Netherlands [email protected] Abstract Reed Solomon (RS) codes are used in a variety of (wireless) communication systems. Although commonly implemented in dedicated hardware, this paper explores the mapping of high-throughput RS decoding on vector DSPs. The four modules of such a decoder, viz. Syndrome Computation, Key Equation Solver, Chien Search, and Forney pose different vectorization challenges. Their vectorizations are explained in detail, including optimizations specific for Embedded Vector Processor (EVP). For RS (255,239), this solution is benchmarked vs published implementations, and scalability up to vector size 64 is explored. The best and the worst case throughput of our implementation is 8 times and 2 times higher respectively than other architectures. example of a vector processor, also suggesting some EVPspecific optimizations. In the result Section (Section 5) our results are benchmarked versus other implementations. 2. Reed Solomon Codes Reed Solomon codes are a subset of Bose-ChaudhuriHocquenghem (BCH) codes and are linear block codes . A Reed Solomon code is specified as RS(n, k) with s-bit symbols. This implies that for every k input symbols, n − k parity symbols are added to make an n-symbol code word. Such a code word is able to correct errors in up to t symbols, where 2t = n − k. An RS(255, 239) code adds 16 parity symbols for every 239 input symbols to make 255 symbols overall of 8 bits each, and is able to correct errors in 8 symbols. Figure 1 shows an example of a systematic 1. Introduction RS code word. In a systematic code-word, input symbols are left unchanged and parity symbols are appended to it. Wireless radio standards (for cellular, broadcast, conPSfrag replacements nectivity, and positioning) are proliferating rapidly and evolving continuously. Accordingly, the trend in mobile Data Parity handsets is toward multi-standard architectures, that support k 2t n a range of radio standards through SW configurability and programmability, or a combination of the two. Figure 1. A typical RS code word Most digital wireless standards require some form of erNoise in the transmission channel often introduces errors ror correction, based, for example, on Reed Solomon (RS) in the code word. Reed-Solomon codes are particularly well codes. Allocating fixed hardware resources for a RS desuited to correcting burst errors i.e. where a series of bits in coder would be wasteful for standards not using this dethe codeword are received in error. Let us say if r(x) is the coder. In such case it would be more efficient to map the received code word, we get decoder function on a more generic, programmable DSP. However, with a few dozen of operations per output samr(x) = c(x) + e(x) (1) ple, and sample rates of 10-100 MHz, RS decoding is rather computational intensive. where c(x) is the original code word and e(x) is the error Wide SIMD machines, also known as vector processors, introduced in the system. The aim of the decoder is to find offer the required compute power. However, the available e(x) and then subtract it from r(x) to recover original code parallelism is constrained to SIMD operations. So-called word transmitted. It should be added that there are two asvectorization, in computer science, is the process of conpects of decoding - error detection and error correction. In verting a computer program from a scalar implementation, RS code, the error can only be corrected if there are at most which does an operation on a pair of operands at a time, to a t errors. More than t errors may be detected, but it is not vectorized program where a single instruction can perform possible to correct them. multiple operations on a pair of vector (series of values adjacent in memory) operands. 2.1. Decoder Structure In this paper, we discuss the vectorization of RS decodA detailed explanation on Reed Solomon decoders can ing, with special attention to scalability in vector sizes up be found in  and . The decoder consists of four modto 64. RS codes and the various modules of RS decoders ules as shown in Figure 2. The first module computes the are introduced in Section 2. It turns out that these different modules require different approaches to vectorization (Secsyndrome polynomial from the code word. The syndromes tion 4). The EVP is introduced (Section 3) as an advanced are used to construct an equation which is solved in the next 978-3-9810801-3-1/DATE08 © 2008 EDAA Data In Syndrome Computation Key Equation Solver Chien Search Forney Algorithm Data Out FIFO 4. Vectorization of the RS Modules Figure 2. RS decoder. module. This key equation solver module generates two polynomials for determining the location and value of errors in the received code word, if any. The next module called Chien Search computes the error locations, while the fourth module employs the Forney algorithm to determine the values of errors detected. A good summary of hardware resources needed for different algorithms for each module for ASIC implementation is provided in . Interestingly, the hardware required is proportional to the error correction capability of the code and not the actual code word length as can be found in . 3. Embedded Vector Processor The EVP16 (Embedded Vector Processor ) is a productized version of the CVP . Although originally developed to support 3G standards, the current architecture proves to be more versatile. The architecture combines SIMD and VLIW parallelism as depicted in Figure 3. The main word width is 16 bits, with support for 8-bit and 32bit data. The EVP16 supports multiple data types, including complex numbers. For example, a complex vector multiplication uses 16 multipliers to multiply 8 complex numbers in two clock cycles. EVP16 has 16 vector and 32 scalar registers. The maximum VLIW-parallelism available equals five vector operations plus four scalar operations plus three address updates plus loop-control. VLIWcontroller controller VLIW programmemory memory program ACU ACU 16 words wide trinsics for vector operations, all in a C-like syntax. The EVP-C compiler takes care of register allocation (scalar and vector registers) as well as VLIW instruction scheduling (scalar and vector operations combined). 1 word wide vector vectormemory memory vector   vectorregisters registers  … Load /Store /Store Load /Store scalar [32  ] scalarRF RF  … Load /Store /Store Load /Store ALU ALU ALU ALU MAC/Shift MAC /Shift MAC /Shift MAC MAC 4.1. Syndrome Computation Syndrome computation is the first stage of RS decoding. 2t syndrome coefficients are needed for decoding a code word with error correction capability of t. Coefficients si , with i = 1, 2, . . . 2t are computed according to the following equation N −1 X si = rj (αi )j (2) j=o where rj refers to N received symbols and α is a root of the primitive polynomial . In order to reduce the computational requirement and complexity, syndromes are often computed using Horner’s rule, according to the following recursive formula si,j = si,j−1 αi + rN −j , j = 1, 2, . . . N (3) where si,0 is set to 0 for each code word at the start of algorithm. At the end of algorithm, si,N contains the required coefficient. Figure 4 shows an example of a syndrome computation cell that represents the above equation. The algorithm to compute syndromes using this formula is presented in Algorithm 1. Rn D si PSfrag replacements αi+1 Figure 4. A typical syndrome computation cell. Shuffle ShuffleUnit Unit Intra Intra- -Vector VectorUnit Unit Code Code- -Generation GenerationUnit Unit Each of the following sub-Sections discusses the design decisions made when implementing RS decoder to EVP. The vectorization technique is different for each module and is determined by how the parallelism in EVP can be best exploited. Scalability is also discussed for each module. The vector processor used here uses vectors of 256 bits divided into 32 8-bit data elements. AXU AXU Figure 3. The EVP-16 architecture In a 90 nm CMOS process, the EVP16 core measures about 3 mm2 (600 k gates), runs 300 MHz (worst case commercial), and dissipates about 1 mW/MHz including a typical memory configuration. Programs are written in EVP-C, a superset of ANSI-C. The extensions include vector data types and function in- Algorithm 1 Syndrome: Serial Computation 1: Read received symbols R[N] 2: for i = 1 to 2 × t do 3: si = 0 4: for j = 1 to N do 5: si ← si × αi + R[N − j] 6: end for 7: end for For RS (255, 239) code, 16 syndromes need to be computed. The simplest approach is to use a vector to store 16 syndrome values of 8-bits, and process one symbol at a time using Equation 3. However, this utilizes only half the vector space, and requires 255 iterations. In order to exploit the entire vector size, we modify the formula to divide each syndrome coefficient into two parts − one for received symbols at even locations and one at odd locations. si = = = i i i (. . . (rN −1 α + rN −2 )α + . . . + r1 )α + r0 (. . . (rN −1 α2i + rN −3 )α2i + . . . + r1 )αi + (. . . (rN −2 α2i + rN −4 )α2i + . . . + r0 ) (si,odd )αi + (si,even ) where si,even is the summation of all symbols received at even locations, while si,odd of those received at odd locations. (The code word may be padded with 0’s to make N a multiple of vector-size. In RS(255,239), since N was 255, and vector-size 32, we added one 0 to make N equal to 256.) The two parts are later combined at the end of the algorithm. This approach requires only 128 iterations to evaluate the syndrome. Algorithm 2 shows the psuedo-code for the same. Algorithm 2 Syndrome: Parallel Computation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: is almost halved as shown later in Section 5. For common code word sizes, the scalar overhead is small compared to the loop body, for vector sizes up to 64. 4.2. Key Equation Solver A number of algorithms and architectures are available for solving the Key-equation . Berlekamp Massey (BM) and Euclidean are the most common algorithms, and each has several architecture options for implementation. The algorithm or architecture that is most suited for ASIC design may not be ideal for implementation on a vector processor. Some of the characteristics that make an algorithm ideal for vectorization are • Parallelism: This determines to what extent we can exploit the length of vector and the benefits of using vector processor. A serial implementation of BM algorithm for example has very little SIMD parallelism and hence defeats the purpose of using a vector processor. • Regularity: Leads to shorter code since loops can be used, and hence less program memory requirements. In view of the above concerns, the dual-line architecture Read received symbols rcvd[N] of BM algorithm is selected for implementation. UnforSet syndrome vector, S to 0. tunately, the full parallelism in the vector processor still Set αhigh = αlow = hα2 , α4 , . . . α4t i. can not be fully exploited, as we only need vector of size for j = 1 to N/2 do 2 × t = 16 for RS (255, 239), while our vector was of size Set Rhigh = hR[N − 2j]i, Rlow =hR[N − 2j + 1]i 32. There is no algorithm that allows the use of full capabilS ←S×α+R end for PSfrag replacementsity of our vector processor. Figure 5 shows the architecture Set αhigh = h1, 1, . . . 1i and αlow = hα1 , α2 , . . . α2t i. of the dual-line implementation. S ←S×α S2 ← S rotated by 16 symbols S ← S + S2 i−1 i−1 i−1 i−1 D Dr D D 1 2n+1 2n Shigh = Slow = required syndrome coefficients Half Syndromes. It has been mentioned that 2t syndrome coefficients are needed for evaluating the correct symbols transmitted, in case of errors. However, if any t continuous syndromes are zero, we may conclude that the code word is either correct or uncorrectable . In practice many of the code words are received correctly. Therefore, in our design we only compute half the syndromes i.e. 8 for RS (255, 239) and compute the rest only if any of these eight is non-zero. In order to exploit the full vector space of 32 elements in EVP, we now need four streams in parallel. The equation for these streams is derived as follows. si = (. . . (rN −1 αi + rN −2 )αi + . . . + r1 )αi + r0 = (. . . (rN −1 α4i + rN −5 )α4i + . . . + r3 )α3i + (. . . (rN −2 α4i + rN −6 )α4i + . . . + r2 )α2i + (. . . (rN −3 α4i + rN −7 )α4i + . . . + r1 )αi + (. . . (rN −4 α4i + rN −8 )α4i + . . . + r0 ) = (si,1 )α3i + (si,2 )α2i + (si,3 )αi + (si,4 ) Each vector therefore contains 4 values of each syndrome coefficient. Thus, in case of no errors, only 256/4 = 64 iterations are needed. Scalability. The algorithm scales very well with the size of vector. When the vector size is doubled, the cycle count 0 C ... i−1 2n 0 i−1 Cr 0 ... C i−1 1 i−1 ∆i C0 Contro ller i−1 Figure 5. Dual-line architecture for modified BM. Scalability. This algorithm doesn’t scale very well, since it only allows exploiting parallelism in processors of size up to 2 × t. For RS(255, 239), with an error correction capability of 8, vector of size 16 is enough. The dependency between different cycles prohibits further exploitation of the EVP. As we see in Section 5, the instruction count for Key Equation solver block remains the same (above parallelism of 16), regardless of the level of parallelism in the processor. 4.3. Chien Search The Chien module is trivial to parallelize and scales well with any level of parallelism available in the processor. This module checks if there is an error at a given location. Various positions can be evaluated in parallel, since there is little dependency. There is some overhead as compared to a fully scalar design, but not significant when compared to the gain achieved by vectorization. 4.4. Forney Algorithm The Forney module is the last stage of the decoder and computes the value of the error at the computed error locations. This value is then subtracted to obtain the actual symbol. One difference in this module from the rest of the decoder is the data dependency of the algorithm. This modules computes the values at locations where error has indeed occurred; and that information is available after computation of Chien. Further, this part of decoder requires a division (or an inverter and a multiplier) which is fairly expensive. There are three ways to implement a divider: • Hardware: A dedicated divider in hardware is not desirable, since it is only used in this module and it is fairly expensive in silicon area. • Lookup-table: Another way is to store the inverse in a look-up table and then multiply the dividend with the inverse of the divisor. The size of lookup table is 256 bytes. A 32-way lookup table requires 8 kB (32 × 256B), which is also excessive given its limited use. • Micro-code: Since we are working in galois arithmetic, computing the inverse of an element or x−1 is equal to computing xN −1 . In our case N = 255, so we need to compute x254 , which can be computed using galois multipliers. It takes 11 multiplications to compute the inverse, thereby requiring 12 multiplications to complete the division. Since we have 255 symbols, and 32 are processed at one time, we only need eight iterations for the entire code word. The number of instructions needed is not very high and therefore feasible. In the final implementation, the micro-code is used to implement a divider. Another decision to be taken in designing algorithm for this module is whether to compute the error values only at the error locations, or for the entire code word. The former (“selective error computation”) requires the extraction of all error locations from the code word, which can hardly be accelerated by means of SIMD instructions. The latter (“oblivious error computation”) can be refined, by computing error values only for those vectors that contain an error location. The choice between selective and oblivious error computation depends on the capabilities of the vector processor, the code size, and on the number of errors. For the benchmarked cases, oblivious error computation is faster. Scalability. The actual algorithm scales gracefully with the length of vector. When the length of vector is doubled, the number of instructions is almost exactly halved. However, the choice of algorithm and architecture is dependent upon the vector-size. If the size of vector is small, a look-up table may be more feasible than using a micro-code. 4.5. Galois Multiplier Galois multiplier forms an integral part of every module in RS decoding, and is therefore discussed here separately. There are various ways to implement Galois multiplier in the EVP architecture. Some of them are mentioned below and their trade-offs are discussed. • Pure Hardware: An 8-bit galois multiplier requires about 140 gates in CMOS . For 32 multiplications in parallel, we therefore need about 4K gates in total. The total number of gates in EVP is around 600K, which puts the cost of parallel Galois multiplication to less than 1% of the total chip. • Look-up Table: Since we have two 8-bit multiplicands, using a look-up table requires storing 16-bit values and 8-bit result. This translates to a table of size 64 KB for each multiplication. Since we need 32 of these in parallel, we need a table of size 32 × 64KB i.e. 2 MB. • Hybrid: This approach uses the property that multiplication is equivalent to addition in logarithmic domain. The logarithm of the multiplicands is looked-up in a log table which is considerably smaller than the first case, since we only need to store 8-bit values. The two log values are added and the result is found by another lookup in the alog table. For Galois arithmetic, the base for both the tables is α. Though it takes considerably less memory, it comes at the cost of extra clock cycles needed for it. If we use one log table (for each multiplication), we need 4 cycles and if we use two tables, we require 3 cycles. In the former case we need 16KB of memory while the latter requires 24 KB. Another very big advantage of this method is the ease with which division can be implemented. Since division is simply subtraction in log domain, the same trick can be used to implement division with the same hardware. • Pure Software: This approach is analogous to using micro-code for implementing complicated instructions. The psuedo-code is shown in Algorithm 3. However, this takes about 40 instructions and thus renders the implementation infeasible. Algorithm 3 Galois Multiplication 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: Read 8-bit inputs a and b Set product, p to 0; p ← 0 for i = 0 to 7 do if b(0) = 0 then p←p⊕a end if c ← a(7) Shift a one bit to left if c = 0 then a ← a ⊕ 0x1b end if Shift b one bit to right end for p has the final product 5. Results and Benchmarking 5.1. EVP results For the EVP, it is assumed that a hardware galois multiplier is present, unless otherwise mentioned. Since for our example we use 8-bit symbols, the supported vector size equals 32. RS (255, 239) was used for benchmarking. Since EVP-C is C-based, programming the RS modules was relatively easy. Ten thousand test cases were tested in all for Table 1. Cycle count of various modules for EVP implementation Syndrome Key Equation Chien 1st half 2nd half EEP ELP Constant 12 20 18 6 Total 7 Forney 10 6 Total *Total = Constant + (Loop × Count) Loop 4 4 6 3 Count 15 15 15 9 5 27 2 24 4 8 7 8 Total* 72 80 108 33 141 27 216 24 198 707 Scalar 6 12 18 6 24 Vector 66 68 90 27 117 S/w Division 66 68 90 27 117 Multiplications 66 68 32 9 41 6 210 210 128 6 54 192 653 272 733 208 511 Table 2. Cycle count of various modules for different architectures Parallelism Degree Syndrome Key Equation Chien Forney Total Number of Cycles Worst Case Throughput (in Mbps) Best Case Throughput (in Mbps) EVP*, RS(255, 239) H/w Divider S/w Divider 32 32 16 64 72 72 138 39 80 80 148 51 141 141 141 141 216 216 426 111 198 278 550 142 707 787 1403 484 860 778 440 1264 8500 8500 4430 15700 TI DSP RS(208, 188) RS(255, 239) 8 8 470 576 Starcore RS(255, 239) 1 5800 263 326 154 1213 411 1060 3816 4128 590 14334 43 105 263 399 188 1426 430 1060 different number of symbol errors in the code word. The errors in the test cases were introduced randomly. We first verified the functional correctness of the decoder by implementing the galois multiplier using a look-up table, and later used a dummy vector instruction to test the instruction count of the algorithm. The initial count obtained was to the tune of 3000 instructions for decoding one code word of 255 symbols i.e. 2040 bits. This was optimized in various iterations after discussions from the compiler experts. Some of the optimizations that helped significantly in reducing the cycle count are listed below. are needed when galois division is carried out in software, and the last column gives the number of galois multiplication operations in the entire decoding process. As expected, software division affects only the Forney module, as it is the only one which needs galois division. We also observe that the majority of operations are vector operations. In some modules where we have two loops − inner and outer loops − the first row shows the inner loop and the second one shows the outer loop. • Shuffle instruction: The EVP shuffle instruction allows selective copying of content from one vector to another based on the pattern provided in a third vector. • Conditional branches: Some conditional branches could be replaced by cheaper masked instructions. • Hardware-loop support: Loops with a fixed number of iterations run faster by using hardware loop support. The counter-type was modified such that the EVP-C compiler recognizes these loops. • Multiple Streams: To exploit the full VLIW parallelism, multiple streams for the Syndrome computation were introduced. Using four streams is optimal, since there are four instructions in each loop. Table 2 shows the cycle counts needed in RS decoding for various architectures. We have compared our implementation with two other vector processors - TI  and Starcore 1 . For the TI DSP, we upscale the results for RS (255, 239) for a fair comparison. We also estimate how many cycles are needed when the vector parallelism in EVP is changed to 16 or 64. In the second column, we show results with a galois divider in hardware, while the other columns assume that the division is done in software. The throughput achieved for various architectures is also shown. The current EVP is capable of running at 300 MHz. Thus, the maximum throughput which can be obtained corresponds to the case when no errors are found i.e. about 8.5 Gbps, while the worst case throughput is about 800 Mbps. In comparison, assuming same frequency, TI achieves a best case throughput of 1.1 Gbps while the worst case throughput of 430 Mbps. For Starcore, the best and worst case throughput is 400 and 40 Mbps respectively at the same frequency. Figure 6 shows an estimate of cycle count expected for different bit error rates (BER) in transmission2 . As mentioned earlier, in the case of no errors, we only need to With the above and some other minor optimizations, the instruction count was reduced to about 700 for decoding in the worst case (8 errors). For the best case (no errors) the instruction count was reduced to as low as 80 instructions. Table 1 shows the number of instructions needed for the different modules. The column labeled ‘Constant’ shows the number of constant instructions for that module, while ‘Loop’ is the number of instructions in a loop and the next column indicates the number of iterations of that loop. Columns ‘Scalar’ and ‘Vector’ indicate the number of respective instructions in the particular module. The second last column is an estimate of how many vector instructions 5.2. Comparison with other implementations 1 Reconfigurable architectures have also been used for implementation, e.g. Morphosys . However, their area and power consumption often makes them undesirable. Morphosys has ten times as much area (with technology scaling) as EVP and consumes thirteen times as much power. 2 Analysis of BER versus SNR can be found in . Starcore operates on a single symbol and was therefore, not scaled up. We again observe that at lower BER, the mean cycle count is lower and corresponds to the number of cycles needed to detect if there is any error in the received code word. As expected, the number of cycles for EVP is only half as compared to the rest since only half the syndromes are computed. Since syndrome computation module scales gracefully, high parallelism does not cause any inefficiency in this module. Another observation we make is that high parallelism is not very efficient. This comes from the fact that not all blocks of the algorithm scale ideally with increasing parallelism. Key Equation solver in particular scales only to a maximum parallelism of 16, and significantly decreases the efficiency for architectures with higher parallelism. However, if decoding speed is crucial, higher parallelism gives higher throughput. Mean Cycle count v/s Bit Error Rate 1600 EVP*-h/w div TI EVP*-s/w div 1400 Mean Cycle Count 1200 1000 800 600 400 200 0 1e-05 0.0001 0.001 0.01 Bit Error Rate In this paper, we present various vectorization techniques for the modules in Reed Solomon decoding. We see that different modules need to be vectorized in different ways. The vectorized decoder is implemented on the EVP and the results are compared with other SIMD implementations of RS decoder. The best and worst case throughput of our implementation is about 8.5 Gbps and 800 Mbps respectively. The best case results are about 8 times higher than the other implementations. However, we observe that there is no architecture that suits all scenarios and the desired architecture should be chosen depending on the decoding speed requirement and bit error rates of the transmission channel. Future research will focus on optimizing power. Figure 6. Mean cycle count v/s bit error rate Normalized Cycle count v/s Bit Error Rate 35000 Normalized Cycle Count 30000 25000 EVP*-16 EVP*-32 EVP*-64 TI-8 Starcore 20000 15000 Acknowledgements 10000 We would like to thank Wim Kloosterhuis and Mahima Smriti from NXP’s DSP Innovation Center for their help in optimizing the EVP-C code. 5000 0 1e-05 6. Conclusions 0.0001 0.001 Bit Error Rate 0.01 Figure 7. Normalized cycle count v/s bit error rate compute half the syndromes. Therefore, as the BER decreases to below 0.0001, the number of cycles needed drops to about 80 for EVP and about 600 for TI. The reason for lower cycle count in EVP is increased parallelism and computation of only half the syndromes. With increasing BER, the number of packets with errors increases as well. When the BER rises over 0.001, most of the packets have error(s) and the cycle count approaches the worst case. The EVP takes about half as many cycles as needed by TI, instead of a quarter, because of scalar instructions which limit the exploitation of full parallelism in EVP. Figure 7 shows the cycle count normalized with the amount of parallelism present in the architecture for varying BER. The normalized cycle count is obtained by multiplying the cycle count with vector parallelism in the architecture. Thus, TI is multiplied by 8, and EVP is multiplied by 16, 32, or 64 depending on the corresponding design. References  S. B. Wicker and V. K. Bhargava; Reed Solomon Codes and Their Applications, Piscataway, NJ: IEEE Press, 1994.  R. E. Blahut; Theory and Practice of Error Control Codes, AddisonWesley, 1983.  A. Kumar and S. Sawitzki; High-Throughput and Low-Power Architectures for Reed Solomon Decoder, Asilomar Conference on Signals, Systems and Computers, 2005.  C.H. (Kees) van Berkel et al; Vector Processing as an enabler for Software-Defined Radio in Handheld Devices, EURASIP Journal on Applied Signal Processing, 2005.  C.H. (Kees) van Berkel et al; CVP: A Programmable Co Vector Processor for 3G Mobile Base-band Processing, Proc. World Wireless Congress, May 2003.  C. Paar and M. Rosner; Comparison of arithmetic architectures for Reed-Solomon decoders in reconfigurable hardware, IEEE Symposium on FPGAs for Custom Computing Machines, 1997.  J. Sankaran; Reed Solomon Decoder: TMS320C64x Implementation, Texas Instruments Inc, SPRA686, Dec. 2000.  D. Taipale et al; Reed Solomon Decoding on the StarCore Processor, Motorola Semiconductors Inc, AN1841/D, May 2000.  A. Koohi, N. Bagherzadeg and C. Pan; A fast parallel Reed-Solomon decoder on a reconfigurable architecture, CODES + ISSS, Oct. 2003.  A. Kumar; High-throughput Reed-Solomon decoder for Ultra Wide band. Masters Thesis, 2004, http://www.ics.ele.tue.nl/∼akash.