Design and Implementation of Open-Source SATA III Core for Stratix V FPGAs Sumedh Guha1 , Wen Wang1 , Shafeeq Ibraheem1 , Mahesh Balakrishnan2 , and Jakub Szefer1 1 Dept. of Electrical Engineering and 2 Dept. of Computer Science Yale University {sumedh.guha, wen.wang.ww349, shafeeq.ibraheem, mahesh.balakrishnan, jakub.szefer}@yale.edu Abstract—SATA is the de-facto standard computer interface that connects a host, typically a computing device, to a persistent storage device, such as a hard drive or solid-state drive. In order for FPGA-based designs to be able to leverage the variety of persistent storage devices, a SATA core is needed. Over time, the SATA standard has been revised to provide greater bandwidth, with SATA III being the newest version of the standard. In this paper, we are the first to present a SATA III core designed for Altera Stratix V FPGAs. Our implementation is written using Verilog, and tested using an industry-standard SATA protocol analyzer. We evaluate the performance of our SATA core by measuring the throughput of random and sequential read and write operations using various hard drives and solid-state drives. In addition, we compare the complexity of our SATA III core implementation with those of the older SATA I and II opensource implementations, and show that SATA III is still feasible, using only about 11% of Stratix V FPGA resources. I. I NTRODUCTION Solid-state drives (SSDs) and hard drives (HDs) offer increasing amounts of cheap, fast, persistent storage with the capacity now routinely on the order of 1TB or even more. To efficiently access such amounts of storage, greater bandwidths are needed to connect a computing device to the storage device. The de-facto standard for connecting computers with storage devices is the SATA protocol. Widely deployed and supported, SATA is an ideal solution if a FPGA-based design needs to interconnect with a storage device. Over the years, SATA standards have been updated, and prior FPGA SATA core designs are not usable with the latest SATA III standard. In this work, we build on an existing SATA I and II opensource design [1] developed for Xilinx devices, and present a new SATA III core design targeting Altera Stratix V FPGAs. The main advantage of the SATA III interface over prior versions is the theoretical maximum bandwidth of 600MB/s, which is greater than the theoretical maximum bandwidths of the SATA I (150 MB/s) and SATA II (300 MB/s) interfaces. The increased bandwidth is achieved with faster clock rates for the serial SATA link (3000MHz) to the disk and the internal SATA III core parallel data (150MHz). To support SATA III, Altera Native PHY is used along with our Physical, Link, and Command Layers, as described in detail in the later sections. The SATA III core presented makes a number of contributions and improvements: • The first open-source FPGA SATA core design targeting Altera devices. • A new SATA physical layer designed to work with Altera Stratix V FPGAs at SATA III speeds using the Stratix Native PHY IP as the basic building block. • The link and command layers of the SATA core written in Verilog. c 2016 IEEE 978-1-5090-5602-6/16/$31.00 • Better performance with HDs and SSDs, compared to existing SATA I and II designs. II. SATA III BACKGROUND Serial ATA (SATA) is a storage interface for connecting computing devices to peripheral devices (e.g. hard drives, solid-state drives, optical drives). SATA is an improvement upon Parallel ATA (PATA), an interface standard of the 1980s. SATA features several advantages over PATA, including improved bandwidth, hot-swapping, greater ease of integration, and lower cost. As a result, SATA has become the standard for storage interfacing since its introduction in 2003. SATA III is one of the most recent revisions of the SATA standard, and runs at raw 6Gb/s line rate, or up to 600MB/s data rate when accounting for 8b/10b encoding. By use of SATA III core, today’s HDs and SSDs are able to achieve rates well beyond 300 MB/s. All SATA interface standards, including SATA III, follow the same five-layer architecture described below. A. Application Layer At the Application Layer, the user specifies the operation (e.g. sequential or random access, read or write), the sector address, and the number of sectors involved with the operation. All operations are in terms of one or multiple 512-byte sectors. Based on the operation, read or write enable signals are sent to trigger the Command Layer. Upon completion of an operation, a status message is read along with data from the device. B. Command Layer The Command Layer receives the operation parameters and determines the appropriate sequence of Frame Information Structures (FIS) to be sent. A finite-state machine (FSM) accounts for the transitions between the different FIS transmissions. The FIS types include Register Transfer Host to Device, Register Transfer Device to Host, DMA Activate, and Data. Once it has decided on the FIS type to send, the Command Layer passes the FIS information to the Transport Layer. C. Transport Layer The Transport Layer constructs, and deconstructs, FISes sent from, and to, the Command Layer. The Transport Layer follows the ATA protocol for preparing the FIS payload for transmission and for extracting status information from received packets. If there are any errors, the Transport Layer requests a retransmission. TABLE I C OMPARISON OF RESOURCE CONSUMPTION OF EXISTING FPGA IMPLEMENTATIONS OF SATA, ALONG WITH OUR NEW DESIGN . Design Open-Source SATA II SATA III Brand UMASS [1] X X Xilinx UNC [2] X X Xilinx Groundhog [3] X X Xilinx UNC [2] X X Xilinx Design Gateway [4] X Xilinx IntelliProp [5] X Altera Our X X Altera a This resource usage was not specified in the cited reference. b For our design, resources used were calculated using conversion from BRAM = 1 M20K, 1 LUT = 1 ALUT, and 1 F/F = 1 Reg. D. Link Layer The Link Layer takes care of framing and delivering each FIS. It uses primitives to mark the boundaries of the FIS. Primitives are also used for managing handshaking between the FPGA and the device. The Link Layer also computes a Cyclic Redundancy Check (CRC) using the data and appends the CRC to the end of the payload when sending FISes. Finally, the Link Layer scrambles the frame information by XORing the frame information with the output of a linear feedback shift register (LFSR). Scrambling is performed to prevent Electromagnetic Interference from corrupting the frames. The Link Layer uses the same CRC and scrambling mechanisms to check for errors and to descramble the incoming packets. If there are any errors, the Link Layer signals the Transport Layer to request for retransmission. The Link Layer also regulates frame transmission by checking buffer underflow and overflow for FISes. E. Physical Layer The Physical Layer serializes/deserializes frames received from the Link Layer and encodes/decodes them using the 8b/10b encoding/decoding scheme before transmission. The Physical Layer is also involved with Out-of-Band (OOB) signaling used to establish the physical link and to negotiate the data transmission speeds. III. R ELATED W ORK Researchers in the high-performance computing field have taken interest in combining FPGAs with nonvolatile storage devices. As an industry standard, SATA provides the perfect interface to use for accessing storage devices. Currently, there exist several implementations of SATA for FPGA devices. One of the first open-source SATA implementations was Groundhog [3], a SATA host bus adapter (HBA) for Xilinx Virtex-5 FPGAs. Groundhog also supports native command queueing, an optimization for read and write command ordering introduced with SATA II. Later, an open-source SATA II core was developed at the University of North Carolina at Charlotte (UNC) [2]. This SATA-core was designed for Virtex-6 devices and the ML605 board. The UNC group also added a DMA engine, bus interface, and a Linux block device driver to make the core available to the operating system. A follow-on work by University of Massachusetts Amherst (UMASS) built off of the UNC design to create a SATA core for Virtex-4 devices, running at both SATA I and SATA II speeds. The UMASS core also features a replay buffer for Model Virtex-4 Virtex-5 Virtex-5 Virtex-6 Virtex-7 Stratix V Stratix V Slices 5128 576 652 570 476 –a 933b BRAM 7 3 0 3 2 –a 8b LUT –a 1282 1537 1334 1024 2224 3266b F/Fs –a 986 763 894 863 –a 2073b Altera to Xilinx resources: 1 Slice = 2 ALM, 1 retransmitting data FISes, a SATA Event Logger for transfering debugging information, and a new physical layer for handling the RocketIO MGT on the Virtex-4. [1]. Among commercial vendors, IntelliProp [5] and Design Gateway [4] are some vendors offering SATA designs. The commercial designs have resource usage slightly better than the open-source designs and are typically available for Xilinx and Altera Devices. Major downside of using the closedsource designs for research projects are the high license costs for the IP cores and no flexibility to adapt the code. In our work, we design and implement the first open-source SATA III core compatible with Altera’s Stratix V FPGAs, targeting the DE5-Net board. Table I presents details of the prior SATA cores and is used to compare them with our work. Details of our design and implementation are presented next. IV. D ESIGN AND I MPLEMENTATION The design of the SATA III core follows the layered design of SATA and the previous open-source project that it builds upon [1]. Figure 1 shows a high-level block diagram of the design. The right-hand side of the figure shows the SATA layers and how they correspond to our design. In particular, our Test Layer and Core Layer together correspond to the SATA Application Layer. Meanwhile, our Link Layer corresponds to the SATA Transport and Link Layers. The top design unit is the Test Layer. It is used to trigger read and write tests, as well as to set options for the operations (sequential or random access). The Test Layer uses a random number generator based on a LFSR as a source of randomness for generating addresses for random accesses. The Test Layer communicates with the Core Layer, which is used to buffer data going to the disk. The core layer has user_fifo, a fifo wherein each sector of data to be written is stored. The data from user_fifo is passed to the Command Layer. The Command Layer waits for one sector of data (512 bytes) to be ready into user_fifo before issuing commands to the Link Layer. The Command Layer is responsible for generating FIS frame headers and data to be sent over to the Link Layer by buffering them in the write_fifo. The Link Layer in turn reads FIS headers and data from write_fifo, appends CRC data, and performs scrambling to create the FIS packet. The FIS is buffered in the transmitter fifo, tx_fifo, and read by the Physical Layer. Inside the Physical Layer, the FIS passes through an OOB submodule. OOB signaling is used during the beginning phase of transmission when the FPGA sends Out-of-Band signals to establish a link with the disk. After initialization, the OOB submodule passes the FIS to the A. Throughput, Excluding Drive Latency Fig. 1. Block diagram showing major components, buffers and FIFOs, and data flow between components. Control signals and state machines are not shown. The right-hand side shows the SATA layers corresponding to the modules in our design. Figures 3 and 4 show the throughput for sequential and random reads and writes for the HD and SSD. The is raw throughput excluding drive latency, i.e. only consider the time for actual transfer of the data on the wire. As a reference, the red horizontal line at the top of each graph shows the theoretical maximum throughput for SATA III, which is 600MB/s. For sequential reads and writes, the throughput increases as the number of sectors per transaction increases to 16. This conforms to the fact that a Data FIS can contain at most 16 sectors of data, thus incurring minimal overhead and maximizing performance at that point. For random reads and writes, the throughput stays about constant, at the same level as the sequential read or write for 1 sector. This is consistent with the fact that every Data FIS can specify one starting address and the size of the request. Thus, for a request that reads or writes N random sectors, the request has to be broken down into N separate 1 sector Data FISes, each with its separate (random) address. Fig. 2. Experimental setup with Sierra M6-1 [6] analyzer used to interpose on FPGA to disk SATA traffic for debugging and gathering measurements. Native PHY. The Native PHY is responsible for serializing, encoding, and sending the data over the physical SATA wires to the HD or SSD. In the reverse direction, the Native PHY deserializes the incoming FIS packet, decodes the FIS packet, and passes it to a receiver buffer, rx_buf. The receiver buffer is used to compensate for differences in the receive clock and the clock used for operation of the rest of the SATA III core. The rx_buf passes the FIS to a receiver FIFO, rx_fifo. The FIS from rx_fifo is descrambled and passed to a read fifo, read_fifo. Simultaneously, the FIS’s CRC is processed to check for any receive errors. The read_fifo is directly accessible from the Test Layer for reading out the contents of the data received from the disk. The IP (intellectual property) modules from Altera used in this design are the Native PHY as well as the buffers and FIFOs. They are configured and generated using Altera tools. Fig. 3. SATA III core sequential and random throughput, excluding drive latency, for HD with 512-byte sectors. V. E VALUATION The design was tested using Altera DE5-Net FPGA with the help of a Sierra M6-1 [6] SATA analyzer. Figure 2 shows a diagram of the setup. The Sierra analyzer is on one side connected to the FPGA via the SATA cable, and on the other side it is connected to the HD or SSD via another SATA cable. The analyzer was used to confirm that the FPGA generated FISes follow the SATA III standard, to detect and to debug protocol errors, and to gather the measurements presented in this paper. All numbers presented are median values of multiple measurements. Due to space limitation, graphs are presented for only one solid-state drive and one hard drive from among multiple ones tested for this project. Fig. 4. SATA III core sequential and random throughput, excluding drive latency, for SSD with 512-byte sectors. B. Throughput, Including Drive Latency Throughput, excluding drive latency, was used to show that the FPGA operates correctly and can almost reach the theoretical maximum throughput. This, however, will not be realistic performance in real applications. Consequently, Figures 5 and 6 show the throughput including drive latency. TABLE II L INES OF CODE REQUIRED FOR EACH MODULE , COMPUTED USING THE cloc [7] PROGRAM . Fig. 5. SATA III core sequential and random throughput, including drive latency, for HD with 512-byte sectors. Module Implementation Lines of Code Test Layer Verilog 498 Core Layer Verilog 264 Command Layer Verilog 470 Link Layer Verilog 990 Physical Layer BDF –a OOB Verilog 336 CRC Verilog 103 (de)scrambler Verilog 103 a BDF (Block Design File) in Altera Quartus is a graphical format for schematic specification of a design, thus lines of code is not specified. VI. C ONCLUSIONS Fig. 6. SATA III core sequential and random throughput, including drive latency, for SSD with 512-byte sectors. Here, the throughput is calculated by dividing the data size by total time of request, from when initial command is sent to drive, to when all of data is returned. This can be more easily compared to end-to-end throughput numbers reported for storage systems. As the number of sectors per request increases, the throughput increases. Notably, the HD write throughput is quite high, as data gets placed in the drive’s buffer (and only later it is pushed to the slow physically rotating disk platers). For SSD, the sequential performance is better than random accessed, again, conforming to expectation. C. Resource Usage Table I showed existing FPGA-based SATA implementations and the resource consumption for each, along with our design. Direct comparison is not possible here as most designs use different FPGA chips from different vendors. However, with our calculated conversion from Altera to Xilinx resources, some comparison is possible and the complexity of the designs can be approximately compared. Our design consumes only about 11% of the Stratix V resources, even though it supports the latest SATA III protocol. The design leaves most of the FPGA chip free for other logic or for instantiating multiple SATA III cores. D. Code Complexity Table II shows the size of the codebase developed for this design. Altera IP code is omitted. Overall, over 2500 lines of Verilog were written for this project. This paper presented the design and implementation of the first open-source SATA III core on Stratix V FPGAs. Building on existing open-source code for SATA I and II, the new core is able to support the newest hard drives and solid-state drives with the SATA III protocol. Our implementation was written using Verilog and tested using an industry-standard Sierra M6-1 SATA protocol analyzer. The analyzer showed correct operation of the SATA III protocol using our design, and was used to gather performance results showing read and write data rates up to 600MB/s. The final design used only about 11% of Stratix V FPGA resources, leaving most of the chip for other logic so that the SATA III core can serve as a building block for other, bigger designs that integrate FPGA and persistent storage. A. Code Availability The source code for the presented design will be made available at http://caslab.eng.yale.edu/code/sata. ACKNOWLEDGMENT We would like to thank Altera for the donation of the DE5-Net FPGA boards used in the experiments and the Quartus software licenses. We are also thankful to Altera support for their assistance in understanding the Native PHY, as well as Prof. Fengnian Xia and his group for use of their high-speed oscilloscope in debugging our design. R EFERENCES [1] C. Gorman, P. Siqueira, and R. Tessier, “An Open-Source SATA Core for Virtex-4 FPGAs,” in International Conference on Field-Programmable Technology (FPT), 2013, pp. 454–457. [2] A. A. Mendon, B. Huang, and R. Sass, “A High Performance, Open Source SATA2 Core,” in International Conference on Field Programmable Logic and Applications (FPL), 2012, pp. 421–428. [3] L. Woods and K. Eguro, “Groundhog - A Serial ATA Host Bus Adapter (HBA) for FPGAs,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2012, pp. 220–223. [4] “Design Gateway, SATA IP Transport & Link Layer Core,” http://www. dgway.com/products/IP/SATA-IP/dg sata ip data sheet 7series en.pdf, accessed July 5, 2016. [5] “IntelliProp, PC-SA101A-HI SATA Host App Core,” http://intelliprop. com/cores-bridge-board-datasheets.htm, accessed July 5, 2016. [6] “Teledyne LeCroy, Sierra M6-1 SATA Protocol Test System,” http://teledynelecroy.com/protocolanalyzer/protocoloverview.aspx? seriesid=279, accessed July 5, 2016. [7] A. Danial, “Count Lines of Code,” https://github.com/AlDanial/cloc, accessed July 5, 2016.