Institutionen för systemteknik Department of Electrical Engineering Examensarbete A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Jonas Einemo Magnus Lundqvist LiTH-ISY-EX--10/4392--SE Linköping 2010 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Jonas Einemo Magnus Lundqvist LiTH-ISY-EX--10/4392--SE Handledare: Olof Kraigher isy, Linköpings universitet Examinator: Dake Liu isy, Linköpings universitet Linköping, 15 June, 2010 Avdelning, Institution Division, Department Datum Date Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language Rapporttyp Report category ISBN Svenska/Swedish Licentiatavhandling ISRN Engelska/English Examensarbete C-uppsats D-uppsats Övrig rapport 2010-06-15 — LiTH-ISY-EX--10/4392--SE Serietitel och serienummer ISSN Title of series, numbering — URL för elektronisk version http://www.da.isy.liu.se/en/index.html http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4292 Titel Title A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Författare Jonas Einemo, Magnus Lundqvist Author Sammanfattning Abstract H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder. Nyckelord Keywords ePUMA, DSP, SIMD, H.264, Parallel Programming, Motion Estimation, DCT Abstract H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder. Acknowledgments We would like to thank everyone that has helped us during our thesis work, especially our supervisor Olof Kraigher for all help and useful hints and our examiner Professor Dake Liu for his support, comments and the opportunity to do this thesis. We would also like to thank Jian Wang for the support on the DMA firmware, Jens Ogniewski for the help with understanding the H.264 standard, our families and friends for their support and for bearing with us during the work on this thesis. Jonas Einemo Linköping, June 2010 Magnus Lundqvist Contents 1 Introduction 1.1 Background . 1.2 Purpose . . . 1.3 Scope . . . . 1.4 Way of Work 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 2 2 2 Overview of Video Coding 2.1 Introduction to Video Coding . . . . 2.2 Color Spaces . . . . . . . . . . . . . 2.3 Predictive Coding . . . . . . . . . . 2.4 Transform Coding and Quantization 2.5 Entropy Coding . . . . . . . . . . . . 2.6 Quality Measurements . . . . . . . . 2.6.1 Subjective Quality . . . . . . 2.6.2 Objective Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 7 7 8 8 8 8 3 Overview of H.264 3.1 Introduction to H.264 . . . . . . . . . . . 3.2 Coded Slices . . . . . . . . . . . . . . . . 3.2.1 I Slice . . . . . . . . . . . . . . . . 3.2.2 P Slice . . . . . . . . . . . . . . . . 3.2.3 B Slice . . . . . . . . . . . . . . . . 3.2.4 SP Slice . . . . . . . . . . . . . . . 3.2.5 SI Slice . . . . . . . . . . . . . . . 3.3 Intra Prediction . . . . . . . . . . . . . . . 3.4 Inter Prediction . . . . . . . . . . . . . . . 3.4.1 Hexagon search . . . . . . . . . . . 3.5 Transform Coding and Quantization . . . 3.5.1 Discrete Cosine Transform . . . . . 3.5.2 Inverse Discrete Cosine Transform 3.5.3 Quantization . . . . . . . . . . . . 3.5.4 Rescaling . . . . . . . . . . . . . . 3.6 Deblocking filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 12 12 12 13 13 14 17 18 18 20 21 22 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i . . . . . ii Contents 3.7 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Overview of the ePUMA Architecture 4.1 Introduction to ePUMA . . . . . . . . 4.2 ePUMA Memory Hierarchy . . . . . . 4.3 Master Core . . . . . . . . . . . . . . . 4.3.1 Master Memory Architecture . 4.3.2 Master Instruction Set . . . . . 4.3.3 Datapath . . . . . . . . . . . . 4.4 Sleipnir Core . . . . . . . . . . . . . . 4.4.1 Sleipnir Memory Architecture . 4.4.2 Datapath . . . . . . . . . . . . 4.4.3 Sleipnir Instruction Set . . . . 4.4.4 Complex Instructions . . . . . 4.5 DMA Controller . . . . . . . . . . . . 4.6 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 27 29 29 29 29 30 31 33 34 34 34 35 5 Elaboration of Objectives 5.1 Task Specification . . . . 5.1.1 Questions at Issue 5.2 Method . . . . . . . . . . 5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 38 38 38 6 Implementation 6.1 Motion Estimation . . . . . . . . . . . . . . . 6.1.1 Motion Estimation Reference . . . . . 6.1.2 Complex Instructions . . . . . . . . . 6.1.3 Sleipnir Blocks . . . . . . . . . . . . . 6.1.4 Master Code . . . . . . . . . . . . . . 6.2 Discrete Cosine Transform and Quantization 6.2.1 Forward DCT and Quantization . . . 6.2.2 Rescaling and Inverse DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 39 40 41 47 49 50 53 . . . . . . . . . 57 57 58 60 62 63 65 69 71 75 . . . . . . . . 7 Results and Analysis 7.1 Motion Estimation . . . . . . 7.1.1 Kernel 1 . . . . . . . . 7.1.2 Kernel 2 . . . . . . . . 7.1.3 Kernel 3 . . . . . . . . 7.1.4 Kernel 4 . . . . . . . . 7.1.5 Kernel 5 . . . . . . . . 7.1.6 Master Code . . . . . 7.1.7 Summary . . . . . . . 7.2 Transform and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 8 Discussion 8.1 DMA . . . . . . . . . . . . . . . . . . 8.2 Main Memory . . . . . . . . . . . . . . 8.3 Program Memory . . . . . . . . . . . . 8.4 Constant Memory . . . . . . . . . . . 8.5 Vector Register File . . . . . . . . . . 8.6 Register Forwarding . . . . . . . . . . 8.7 New Instructions . . . . . . . . . . . . 8.7.1 SAD Calculations . . . . . . . 8.7.2 Call and Return . . . . . . . . 8.8 Master and Sleipnir Core . . . . . . . 8.9 ePUMA H.264 Encoding Performance 8.10 ePUMA Advantages . . . . . . . . . . 8.11 Observations . . . . . . . . . . . . . . iii . . . . . . . . . . . . . 79 79 79 80 80 80 80 81 81 81 81 82 82 83 9 Conclusions and Future Work 9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 85 86 Bibliography 87 A Proposed Instructions 89 B Results 92 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Contents List of Figures 2.1 2.2 Overview of the data flow in a basic encoder and a decoder . . . . YUV 4:2:0 sampling format . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 5 7 Overview of the data flow in an H.264 encoder . . . . . . . . . . . 4x4 luma prediction modes . . . . . . . . . . . . . . . . . . . . . . 16x16 luma prediction modes . . . . . . . . . . . . . . . . . . . . . Different ways to split a macroblock in inter prediction. . . . . . . Subsamples interpolated from neighboring pixels . . . . . . . . . . Multiple frame prediction . . . . . . . . . . . . . . . . . . . . . . . Large(a) and small(b) search pattern in the hexagon search algorithm. Movement of the hexagon pattern in a search area and the change to the smaller search pattern. . . . . . . . . . . . . . . . . . . . . . 3.9 DCT functional schematic . . . . . . . . . . . . . . . . . . . . . . . 3.10 IDCT functional schematic . . . . . . . . . . . . . . . . . . . . . . 3.11 Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) 3.12 Pixels in blocks adjacent to vertical and horizontal boundaries . . 12 13 13 14 15 16 17 4.1 4.2 4.3 4.4 4.5 28 28 30 33 35 ePUMA memory hierarchy . . . . . . ePUMA star network interconnection . Senior datapath for short instructions Sleipnir datapath pipeline schematic . Sleipnir Local Store switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 18 19 20 24 24 Motion estimation program flowchart . . . . . . . . . . . . . . . . . Motion estimation computational flowchart . . . . . . . . . . . . . Hexagon search program flow controller . . . . . . . . . . . . . . . Proposed implementation of call and return hardware . . . . . . . Reference macroblock overlap . . . . . . . . . . . . . . . . . . . . . Reference macroblock partitioning for 13 data macroblocks . . . . Master program flowchart . . . . . . . . . . . . . . . . . . . . . . . Memory allocation of data memory in the master(a) and main memory allocation(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Sleipnir core motion estimation task partitioning and synchronization 6.10 DCT flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Memory transpose schematic . . . . . . . . . . . . . . . . . . . . . 42 43 44 45 45 46 47 7.1 7.2 7.3 72 73 7.4 7.5 Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed . Frame 10 from Pedestrian Area video sequence . . . . . . . . . . . Difference between frame 10 and frame 11 in Pedestrian Area video sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence . . . . . . . . . . . . . . . . . . . . Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation . . . . . . . . . . . . . . . . . 48 49 51 51 73 74 74 Contents v 8.1 8.2 Sleipnir core DCT task partitioning and synchronization . . . . . . Memory allocation of macroblock in LVM for intra coding . . . . . 83 83 A.1 A.2 A.3 A.4 HVBSUMABSDWA HVBSUMABSDNA HVBSUBWA . . . . HVBSUBNA . . . . 89 90 90 91 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Contents List of Tables 3.1 3.2 3.3 Qstep for a few different values of QP . . . . . . . . . . . . . . . . . Multiplication factor MF . . . . . . . . . . . . . . . . . . . . . . . . Scaling factor V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 22 23 4.1 4.2 4.3 4.4 Pipeline specification . . . . . . . . . . Register file access types . . . . . . . . Address register increment operations Addressing modes examples . . . . . . 30 31 32 32 7.1 7.2 7.3 Short names for kernels that have been tested . . . . . . . . . . . . Description of table columns . . . . . . . . . . . . . . . . . . . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core . . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores . . . . Block 1 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core . . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores . . . . Block 2 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores . . . . Kernel 3 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores . . . . Kernel 4 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores . . . . Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores . . . . . Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores . . . . Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores . . . . . . Kernel 5 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Master code cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prolog and epilog cycle costs . . . . . . . . . . . . . . . . . . . . . Simulated epilog cycle cost including waiting for last Sleipnir to finish DMA cycle costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 59 59 59 60 61 61 62 62 63 64 64 65 66 66 67 67 68 69 70 70 71 Contents vii 7.24 Costs for DCT with quantization block and IDCT with rescaling block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 B.1 Simulation cycle cost of motion estimation kernels . . . . . . . . . 92 viii Contents Abbreviations AGU ALU AVC CABAC CAVLC CB CM CODEC DCT DMA DSP ePUMA Address Generation Unit Arithmetic Logic Unit Advanced Video Coding Context-based Adaptive Binary Arithmetic Coding Context-based Adaptive Variable Length Coding Copy Back Constant Memory COder/DECoder Discrete Cosine Transform Direct Memory Access Digital Signal Processing Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access FIR Finite Impulse Response FPS Frames Per Second FS Full Search HDTV High-Definition Television HVBSUBNA Half Vector Bytewise SUBtraction Not word Aligned HVBSUBWA Half Vector Bytewise SUBtraction Word Aligned HVBSUMABSDNA Half Vector Bytewise SUM of ABSolute Differences Not word Aligned HVBSUMABSDWA Half Vector Bytewise SUM of ABSolute Differences Word Aligned IDCT Inverse Discrete Cosine Transform IEC International Electrotechnical Commission ISO International Organization for Standardization ITU International Telecommunications Union LS Local Storage LVM Local Vector Memory MAE Mean Abolute Error MB Macroblock MC Motion Compensation ME Motion Estimation MF Multiplication Factor MPEG Moving Picture Experts Group MSE Mean Square Error NAL Network Abstraction Layer NoC Network on Chip PM Program Memory PSNR Peak Signal to Noise Ration QP Quantization Parameter RAM Random Access Memory Contents RGB ROM SAD SPRF STI V VCEG VRF YUV ix Red, Green and Blue, A color space Read Only Memory Sum of absolute difference SPecial Register File Sony Toshiba IBM Rescaling Factor Video Coding Experts Group Vector Register File A color space Chapter 1 Introduction This chapter gives a background to the thesis, defines the purpose, scope, way of work and presents the outline of the thesis. 1.1 Background With new handheld devices and mobile systems with more advanced services the need for increased computational power at low cost, both in terms of chip area and power dissipation, is ever increasing. Now that video playback and recording are more standard applications than features in mobile devices, high computational power at a low cost is still a problem without a sufficient solution. The Division of Computer Engineering at the Department of Electrical Engineering at Linköpings Tekniska Högskola has for some time been part of a research project called ePUMA, which can be read out as Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access. The development is driven by the pursuit of the next generation of digital signal processing demands. By developing a cheap and low power processor with large calculation power this new architecture aims to meet tomorrows demands in digital signal processing. The main applications for the processor is future radio base stations, radar and High-Definition Television (HDTV). H.264 is a standard for video compression that saw daylight back in 2003. It is now a mature and widely spread standard that is used in Blu-Ray, popular video streaming websites like Youtube, television services and video conferencing. It provides very good compression at the cost of high computational complexity. The hope is that the ePUMA multi-core architecture will be able to handle realtime video encoding using the H.264 standard. At the Division of Computer Engineering previous work has been done on implementing an H.264 encoder for another multi-core architecture. This work was done on the STI Cell which is used in e.g. the popular video gaming console PLAYSTATION 3. 1 2 1.2 Introduction Purpose The purpose of this master thesis is to evaluate the capability of the ePUMA processor architecture, in aspect of real-time video encoding using the H.264 video compression standard and aim to find and expose possible areas of improvement on the ePUMA architecture. This will be done by implementing parts of an H.264 encoder and if possible compare the cycles needed to the previously implemented STI Cell H.264 encoder. 1.3 Scope By implementing the most computationally expensive parts in the H.264 standard it would be possible to better estimate if the ePUMA processor architecture is capable of encoding video using the H.264 standard in real time. Studying the H.264 standard it can be seen that entropy coding is the most time consuming part if it is done in software. Because of the large amount of bit manipulations needed, it is not feasible to perform entropy coding in the processor. Therefore an early decision was made that entropy coding had to be hardware accelerated and that it should not be a part of this thesis. In this thesis no exact hardware costs for performance improvement will be calculated but instead a reasoning of feasibility will be done. The time constraint of this master thesis is twenty weeks which restricts the extent of the work. Because of the time constraint some parts of a complete encoder have had to be left out. 1.4 Way of Work One of the most time consuming tasks is motion estimation which together with discrete cosine transform and quantization became the primary target for evaluation. First a working implementation was produced. An iterative development was then used to refine the implementations and reach better performance. The partial implementations of the H.264 standard were written for the ePUMA system simulator. The simulator was also used for all performance measurements of the implementations using frames from several different commonly used test video sequences. Once the performance measurement results were acquired they were analyzed and the conclusions were made. The way of work is elaborated in section 5.2 and section 5.3. 1.5 Outline This thesis is aimed at an audience with an education in electrical engineering, computer engineering or similar. Expertise in video coding or the H.264 standard is not necessary as the main principles of these topics will be covered. The outline of this thesis is ordered as naturally as possible where this introduction chapter is followed by theoretical chapters containing the topics needed 1.5 Outline 3 to understand the rest of the thesis. The first of these is chapter 2 which covers the basics of video coding followed by chapter 3 which offers an introduction to the H.264 video coding standard. The last theoretical chapter is chapter 4 which covers the hardware architecture and toolchain of the ePUMA processor. The theory is followed by chapter 5 where a more detailed task specification, method and procedure of the thesis is presented with help from the knowledge obtained from the theoretical chapters. After that chapter 6 describes the function and development of the implementations produced. Chapter 7 then presents the results obtained and gives an analysis of them. Chapter 8 contains a discussion about the results as well as ideas thought of while working on this thesis. The final chapter is chapter 9 which contains the conclusions and the future work that could be done in the area. Chapter 2 Overview of Video Coding This chapter gives an introduction to video coding, color spaces, predictive coding, transform coding and entropy coding. The knowledge is necessary to be able to understand the rest of the thesis. 2.1 Introduction to Video Coding A video consists of several images, called frames, showed in a sequence. The amount of space on disk required to store a sequence of raw data is huge and therefore video coding is needed. The purpose of video coding is to minimize the data to store on disk or the data to send over a network, without decreasing the image quality too much. There are a lot of techniques and algorithms on the market to do this such as MPEG-2, MPEG-4 and H.264/AVC. [10] Video Data Predictive coding Transform coding & Quantization Entropy coding Encoded Video Decoded Video Predictive decoding Inverse transform & Rescaling Entropy decoding Encoded Video Figure 2.1: Overview of the data flow in a basic encoder and a decoder All of these algorithms are constructed out of a similar template. First some technique is used to reduce the amount of data to be transformed. The video is then transformed with for example a Discrete Cosine Transform (DCT). After this a quantization is performed to shrink the data further. The data is then pushed 5 6 Overview of Video Coding through an entropy coder such as Huffman or a more advanced algorithm such as Context-based Adaptive Binary Arithmetic Coding (CABAC) or Context-Based Arithmetic Variable Length Coding (CAVLC) which all compress the data based on patterns in the bit-stream. [10] The data flow of a basic encoder and a basic decoder is illustrated in figure 2.1. As mentioned a video sequence consists of many frames. In video coding these frames can be divided into something called slices. A slice can be a part of a frame or contain the complete frame. This slice division is advantageous because it gives ability to know e.g. that data in a slice does not depend on data outside the slice. The frames are also divided into something called macroblocks. A macroblock is a block consisting of 16×16 pixels. This partitioning of the data makes computations easier to organize and structure. [10] 2.2 Color Spaces To understand video coding some knowledge about different color spaces is needed. One of the color spaces out there is RGB, which name comes from its components red, green and blue. With these three colors and different intensities of them it is possible to visualize all colors in the spectra. Another commonly used color space is Y Cb Cr , also called YUV. In this color space Y represents the luminance (luma) component, which corresponds to the brightness of a specific pixel. The other two components, namely Cb and Cr , are chrominance (chroma) components which carry the color information. [10] The conversion from the RGB color space to the YUV color space is shown in equation (2.1). Y = kr R + kg G + kb B Cb = B − Y Cr = R − Y (2.1) Cg = G − Y As seen in equation (2.1) there also exists a third chrominance component for green, namely Cg , which thanks to equation (2.2) can be calculated as shown in equation (2.3). This means that Cg can be calculated by the decoder and does not have to be transmitted which is advantageous in the sense of data compression. [10] kb + kr + kg = 1 (2.2) Cg = Y − Cb − Cr (2.3) The human eye is more sensitive to luminance than to chrominance and because of that a smaller set of bits can be used to represent the chrominance and a larger for representation of luminance. With this feature of the YUV color space the total amount of bits needed to encode a pixel can be reduced. A common way to do this is by applying the 4:2:0 sampling format. 2.3 Predictive Coding 7 Y sample Cr sample Cb sample Figure 2.2: YUV 4:2:0 sampling format The 4:2:0 sampling format can be described as a ’12 bits per pixel’ format where there are 2 samples of chrominance for every 4 samples of luminance as shown in figure 2.2. If each sample is stored using 8 bits this will add up to 6 ∗ 8 = 48 bits for 4 YUV 4:2:0 pixels with an average of 48/4 = 12 bits per pixel. [10] 2.3 Predictive Coding There are two kinds of predictive coding, intra coding and inter coding. By studying a picture it is easy to see that some parts in the picture are very similar, this is called spatial correlation. The predictive coding that uses these spatial correlations within a frame to form a prediction of other parts of the frame is called intra coding. By studying a sequence of pictures or a video sequence it can be seen that there is usually not much difference between the frames, this is called temporal correlation. By exploiting this temporal correlation a difference, also called a residue, can be calculated which is comprised of smaller values and therefore can be described with a smaller number of bits. This will result in better data compression. The predictive coding that uses temporal correlations between different frames is called inter coding. [10] 2.4 Transform Coding and Quantization The purpose of transform coding is to convert the image data or motion compensated data into another representation of data. This can be done with a number of different algorithms where the block based Discrete Cosine Transform (DCT) is one of the most common in video coding. The DCT algorithm converts the data to be described into sums of cosine functions oscillating at different frequencies. [10] 8 Overview of Video Coding There are some different transforms that could be used in video coding but the common property of them all is that they are reversible, meaning the transform can be reversed without loss of data. This is an important property because otherwise drift between the encoder and decoder can occur and special algorithms would have to be applied to correct these errors. As mentioned before block based transform coding is the most common. When using block based transform coding the picture is divided into smaller block such as 8 × 8 or 4 × 4 pixels. Each block is then transformed with the chosen transform. The transformed data is then quantisized to remove high frequency data. This procedure can be done because the human eye is insensitive to higher frequencies and therefore these can be removed without any noticeable loss of quality. The quantizer re-maps the input data with one range of values to the output data which has a smaller range of possible values. This means the output can be coded with fewer bits than the original data and in this way data compression is achieved. [10] 2.5 Entropy Coding Entropy coding is a lossless data compression technique. The different entropy coding algorithms encode symbols that occur often with a few number of bits and symbols that occur less often with more bits. The bits are all put in a bitstream that could be written to disk or sent over a network. In video coding these symbols can be quantisized transform coefficients, motion vectors, headers or other information that should be sent to be able to decode the video stream. As mentioned earlier a few of the usual entropy coding algorithms are Huffman, CABAC and CAVLC. [10] 2.6 Quality Measurements There exists several ways to measure the quality of images and compare uncompressed images with reconstructed ones to evaluate video coding algorithms. 2.6.1 Subjective Quality Subjective quality is the quality that someone watching an image or a video sequence experiences. Subjective quality can be measured by having evaluators rate each part of a series of images or video sequences with different properties. This can be a time consuming and unpractical way of measurement in most circumstances. [10] 2.6.2 Objective Quality To enable more automatic measurements of quality some algorithms are commonly used. One of these is Peak Signal to Noise Ratio (PSNR) which can be used to measure the quality of a reconstructed image by comparing it to an uncompressed 2.6 Quality Measurements 9 one. PSNR gives a logarithmic scale where a higher value is better. The Mean Square Error (MSE) is used in the calculation of PSNR and is calculated as m M SE = n 1 XX (C(i, j) − R(i, j))2 m ∗ n i=1 j=1 (2.4) where n is the image height, m is the image width and C and R are the current and reference images being compared. With the MSE value the PSNR can be calculated as bits 2 −1 P SN R = 10 ∗ log10 (2.5) M SE where 2bits −1 is the largest representable value of a pixel with the specified number of bits. [10] Chapter 3 Overview of H.264 This chapter presents an overview of the H.264 video compression standard. Some sections are more detailed than others because of relevance for the master thesis. The topics covered include the different frame and slice types, intra and inter prediction, transform coding, quantization, deblocking filter and finally entropy coding. 3.1 Introduction to H.264 H.264[12], also known as Advanced Video Coding (AVC) and MPEG-4 Part 10, is a standard for video compression. The standard has been developed by Video Coding Experts Group (VCEG) of International Telecommunications Union (ITU) and Moving Picture Experts Group (MPEG) which is a working group of the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC). The main objective when H.264 was developed was to maximize the efficiency of the video compression but also to provide a standard with high transmission efficiency which supports reliable and robust transmission of data over different channels and networks. [10] H.264 is divided into a number of different profiles. These profiles include different parts of the video coding features from the H.264 standard. Some of the most common ones are the Extended, Baseline, Constrained Baseline and Main profiles. The Baseline profile supports inter and intra coding and entropy coding with CAVLC. The Main profile supports interlaced video, inter coding using Bslices and entropy coding using CABAC. The Extended profile does not support interlaced video nor CABAC but supports switching slices and has improved error resilience. [10] In figure 3.1 a detailed view of the data flow in an H.264 encoder can be seen. This figure illustrates the important prediction coding and how it is connected to the other parts of the encoder. The in-loop deblocking filter can also be seen in this illustration. [10] 11 12 Overview of H.264 DCT + Fn (current frame) ME MC (motion estimation) (motion compensation) (discrete cosine transform) - Q F´n-1 (quantization) (reference frame) Choose Intra Prediction F´n (reconstructed frame) Deblocking Filter R Intra prediction Reorder (rescaling) + + IDCT (inverse discrete cosine transform) Entropy encode NAL Figure 3.1: Overview of the data flow in an H.264 encoder 3.2 Coded Slices A frame can be divided into smaller parts called slices. These slices can then be coded in different modes. The different coding modes in H.264 is presented below [14]. 3.2.1 I Slice In the I slice all macroblocks are intra coded. The encoder uses the spatial correlations within a single slice to code that slice. The I slice allocates most space of all the different types of slices after it has been encoded. [10] 3.2.2 P Slice P slices can contain both I coded macroblocks and P coded macroblocks. P coded macroblocks are predicted from a list of reference macroblocks. [10] 3.2.3 B Slice B slices or bidirectional slices can contain both B coded macroblocks and I coded macroblocks. B coded macroblocks can be predicted from two different lists of reference macroblocks both before and after the current frame in time. [10] 3.2.4 SP Slice A Switching P (SP) slice is coded in a way that supports easy switching between similar precoded video streams without suffering a high penalty for sending a new I slice. [10] 3.3 Intra Prediction 3.2.5 13 SI Slice A Switching I (SI) slice is an intra coded slice and supports easy switching between two different streams that does not correlate. [10] 3.3 Intra Prediction In intra coding the encoder only uses data from the current frame. Intra prediction is the next step in this direction to try to minimize the coded frame size. With intra prediction the encoder tries to utilize the spatial correlation within the frame.[10] 0 (Vertical) M A B C D E F G H I J K L 1 (Horizontal) M A B C D E F G H I J K L 2 (DC) M A B C D E F G H I Mean J K (A .. D, L I .. L) 3 (Diagonal down-left) 4 (Diagonal down-right) M A B C D E F G H M A B C D E F G H I I J J K K L L 5 (Vertical-right) M A B C D E F G H I J K L 6 (Horizontal-down) M A B C D E F G H I J K L 8 (Horizontal-up) M A B C D E F G H I J K L 7 (Vertical-left) M A B C D E F G H I J K L Figure 3.2: 4x4 luma prediction modes 1 (Horizontal) H V ……. H V ……. 0 (Vertical) V 2 (DC) 3 (Plane) H H Mean (V, H) V Figure 3.3: 16x16 luma prediction modes H.264 supports 9 different intra prediction modes for 4x4 sample luma blocks, four different modes for 16x16 sample luma blocks and four modes for 8x8 chroma components. The 9 4x4 prediction modes are illustrated in figure 3.2 and the 4 16x16 luma prediction modes are illustrated in figure 3.3. The pixels are interpolated or extrapolated from the pixels nearby i.e the pixels with letters. Usually 14 Overview of H.264 the encoder selects the prediction mode that minimizes the difference between the predicted block and the block to be encoded. I_PCM is another prediction mode which makes it possible to transmit samples of an image without prediction or transformation. [10, 14] 3.4 Inter Prediction Inter prediction creates a prediction model from one or more previously encoded video frames or slices using block-based motion compensation. The motion vector precision can be up to a quarter pixel resolution. The task is to find a vector that points to a block of pixels that have the smallest difference between the reference block and the block in the frame that is being encoded. [10] 16x8 16x16 8x4 8x8 16x8 8x4 4x4 4x4 8x8 8x8 4x8 4x8 4x4 4x4 8x16 8x16 8x8 8x8 Figure 3.4: Different ways to split a macroblock in inter prediction. H.264 supports a range of block sizes from 16x16 to 4x4 pixels. This is illustrated in figure 3.4. Using big blocks will save data because you will not need as many motion vectors, but the distortion can be very high when there are a lot of small things moving around in the video sequence. Using smaller blocks will in many cases lower the distortion but will instead increase the amount of bits needed to store the increased number of motion vectors. By letting the encoder find the best solution for this a good data compression of the video sequence can be achieved. The blocks are split when a threshold value is reached. [10] SAD = m X n X |C(i, j) − R(i, j)| (3.1) i=1 j=1 m n 1 XX M SE = (C(i, j) − R(i, j))2 m ∗ n i=1 j=1 (3.2) 3.4 Inter Prediction 15 m n 1 XX M AE = |C(i, j) − R(i, j)| m ∗ n i=1 j=1 (3.3) The macroblock cost is commonly calculated in one of a few different ways, Sum of Absolute Difference (SAD) is the most common as it offers the lowest computation complexity. The definition of SAD can be found in equation (3.1). Two other common ways to calculate the cost are Mean Square Error (MSE) and Mean Absolute Error (MSE) presented in equation (3.2) and equation (3.3) respectively. In equation (3.1), equation (3.2) and equation (3.3) n is the image width and m is the image height. [10] E F 3 4 K L A 1 B C 2 D b f j q s c H g k m r N R 7 S T 8 U G d h n M a e i p I J 5 6 P Q Figure 3.5: Subsamples interpolated from neighboring pixels More accurate motion estimation in form of sub pixel motion vectors is available in H.264. Up to a quarter pixel resolution is supported for the luma component and one eighth sample resolution for the chroma components. This motion estimation is possible to do by interpolating neighboring pixels and then compare with the current frame in the encoder. The interpolation is performed by a 6 tap Finite Impulse Response (FIR) filter with weights (1/32, −5/32, 20/32, 20/32, −5/32, 1/32). [10] In figure 3.5 the half pixel sample b can be located. To generate this sample equation (3.4) can be used. Sample m can be calculated in a similar way shown in equation (3.5). [10] b = round((E − 5F + 20G + 20H − 5I + J)/32) (3.4) m = round((B − 5D + 20H + 20N − 5S + U )/32) (3.5) 16 Overview of H.264 After generating all half pixel samples from real samples there are some half pixel samples that have not been generated. These samples have to be generated from already generated samples. The sample j in figure 3.5 is an example of that. To generate j the same FIR filter is used but with samples 1, 2, b, s, 7 and 8. j could also be generated with samples 3, 4, h, m, 5 and 6. Note that unrounded versions of the samples should be used when calculating j. When all half pixel samples are generated it is time to generate the quarter pixel samples. This is done by linear interpolation. Sample a in figure 3.5 is calculated as in equation (3.6) and sample d is calculated as in equation (3.7). To generate the last samples two diagonal half pixel samples are used, see equation (3.8). [10] a = round((G + b)/2) (3.6) d = round((G + h)/2) (3.7) e = round((h + b)/2) (3.8) To enhance the video compression even more H.264 has support for predicting macroblocks from more than one frame. This can be applied to both B and P coded slices. With the possibility to predict macroblocks from different frames a much better video compression can be achieved. The downside with multiframe prediction is an increase cost of memory size, memory bandwidth and computational complexity. [10] Previous Frames Current Frame Following Frames Figure 3.6: Multiple frame prediction To find the best motion vector the encoder uses a search algorithm such as Full Search (FS), Diamond Search or Hexagon Search. With Full Search a complete search of the whole search area is performed. This algorithm provides the best compression efficiency but is also the most time consuming algorithm. Diamond search is a less time consuming search algorithm where the search pattern is formed as a diamond. Its performance in terms of compression, is good in comparison with FS. Hexagon search is an even more refined search pattern where the search points are formed as a hexagon, figure 3.7a. By decreasing the number of search points the effort to calculate the motion vector will be minimized and the result will be almost as good as with Diamond Search [16]. Motion estimation is the part in H-264 encoding that consume the most computational power when encoding and is predicted to consume about 60% to 80% of the total encoding time[15]. 3.4 Inter Prediction 3.4.1 17 Hexagon search Hexagon search uses a 7 point search pattern which can be seen i figure 3.7a. Each cross in the grid represents a search point in the search area where the grid resolution is one pixel. From this search point a Sum of Absolute Difference, equation (3.1), is calculated. [16] (a) (b) Figure 3.7: Large(a) and small(b) search pattern in the hexagon search algorithm. The search steps in the hexagon search are the following. 1. Calculate the SAD of the six closest search points and the current search point. 2. Put the search point with the smallest SAD as new current search point. If the middle point has the smallest SAD jump to step 5. 3. Calculate the SAD of the 3 new search points that have not yet been calculated as illustrated in figure 3.8. 4. Jump to step 2 5. Calculate the SAD of the 4 new search points forming a diamond around the middle point. This is illustrated in figure 3.7b. 6. Choose the search point that resulted in the smallest SAD and form a motion vector to this search point. When the smallest SAD is found the motion compensated residue can be calculated. This residue is then sent to the transformation part of the encoder for further processing. In the decoder the motion vectors are used to restore the image correctly from the residue that was sent from the encoder. [16] 18 Overview of H.264 3 3 4 5 2 2 5 3 5 4 5 1 1 1 1 1 2 4 1 1 Figure 3.8: Movement of the hexagon pattern in a search area and the change to the smaller search pattern. 3.5 Transform Coding and Quantization The main transform used in H.264 is discrete cosine transform. 3.5.1 Discrete Cosine Transform The Discrete Cosine Transform (DCT) is a widely used transform in image and video compression algorithms. In H.264 the DCT decorrelates the residual data before quantization takes place. The DCT is a block based algorithm which means it transforms one block at the time. In prior standards to H.264 the blocks were 8x8 pixels large but that is now changed to 4x4 samples to reduce the blocking effects, which reduces the visual quality in the video. The DCT used in H.264 is a modified two-dimensional (2D) DCT transform. The transform matrix for the modified 2D DCT can be found in equation (3.9). [10] 1 2 Cf = 1 1 1 1 −1 −2 1 −1 −1 2 1 −2 1 −1 The 2D DCT transform in H.264 is given by equation (3.10) (3.9) 3.5 Transform Coding and Quantization Y = Cf XCfT ⊗ Ef 1 1 1 2 1 −1 = 1 −1 −1 1 −2 2 19 = 1 1 1 −2 X 1 1 −1 1 2 1 −1 −2 1 −1 −1 1 2 a 1 ab −2 ⊗ 2 2 a2 ab −1 2 ab 2 b2 4 ab 2 b2 4 a2 ab 2 2 a ab 2 ab 2 b2 4 ab 2 b2 4 (3.10) where 1 a= r2 2 b= 5 (3.11) (3.12) and X is the 4x4 block of pixels to calculate the DCT of. To simplify computation somewhat the post-scaling (⊗Ef ) can be absorbed into the quantization process. [10] This will be described in more detail in section 3.5.3 which covers the quantization. The modified 2D DCT is an approximation to the standard DCT. It does not give the same result but the compression is almost identical. The advantages with this approximation is that the core equation Cf XCfT can be done in 16 bit arithmetics with only shifts, additions and subtractions [6]. To do a two-dimensional DCT two one-dimensional DCTs can be performed after each other, the first one on rows and the second one on columns or vice versa. The function of the one-dimensional DCT can be seen in figure 3.9. [6] x0 + x1 + x2 - + x3 - + -2 2 + X0 + X2 + X1 + X3 Figure 3.9: DCT functional schematic The operations performed while calculating the DCT as shown in figure 3.9 can be written as equation (3.13). X0 = (x0 + x3 ) + (x1 + x2 ) X2 = (x0 + x3 ) − (x1 + x2 ) X1 = (x1 − x2 ) + 2(x0 − x3 ) X3 = (x1 − x2 ) − 2(x0 − x3 ) (3.13) 20 3.5.2 Overview of H.264 Inverse Discrete Cosine Transform The transform that reverses DCT is called Inverse Discrete Cosine Transform (IDCT). With the design of the DCT in H.264 it is possible to ensure zero mismatch between different decoders. This is because the DCT and IDCT(3.14) can be calculated in integer arithmetics. In the standard DCT some mismatch can occur caused by different representation and precision of fractional numbers in encoder and decoder. [10] The 2D IDCT transform in H.264 is given by Xr = CiT (Y ⊗ Ei )Ci = 2 1 1 1 1 a 2 1 ab 1 −1 −1 2 = 1 − 1 −1 1 Y ⊗ a2 2 ab 1 −1 1 − 12 ab a2 b2 ab ab a2 b2 ab ab ! 1 b2 1 ab 1 1 b2 2 1 1 2 −1 −1 1 − 21 −1 1 1 −1 (3.14) 1 − 12 where Xr is the reconstructed original block and Y is the previously transformed block. As with the DCT the pre-scaling (⊗Ei ) can be absorbed into the rescaling process. [10] This will be described in more detail in section 3.5.4 which covers the rescaling. X0 X2 - X1 1/2 X3 1/2 + + x0 + + x1 + - + x2 + - + x3 Figure 3.10: IDCT functional schematic The function of the IDCT can be seen in figure 3.10. To do a two-dimensional IDCT two one-dimensional IDCTs are performed after each other, the first one on rows and the second one on columns or vice versa. [6] The operations performed while calculating the IDCT can be written as equation (3.15). 1 x0 = (X0 + X2 ) + (X1 + X3 ) 2 1 x1 = (X0 − X2 ) + ( X1 − X3 ) 2 1 x2 = (X0 − X2 ) − ( X1 − X3 ) 2 1 x3 = (X0 + X2 ) − (X1 + X3 ) 2 (3.15) 3.5 Transform Coding and Quantization 3.5.3 21 Quantization Information is often concentrated to the lower frequency area, therefore quantization can be used to further compress the data after applying the DCT. H.264 uses a parameter in the quantization called Quantization Parameter (QP). The QP describes how much quantization that should be applied i.e. how much data that should be truncated. A total of 52 values ranging from 0 to 51 are supported by the H.264 standard. Using a high QP will decrease the coded data in size but it will also decrease visual quality of the coded video. With QP = 0 the quantization will be zero and all data is kept. [10] From QP the quantizer step size (Qstep ) can be derived. The first values of Qstep is presented in table 3.1. Note that Qstep doubles in value for every increase of 6 in QP. The large number of step sizes provides the ability to accurately control the trade off between bitrate and quality in the encoder. [10] QP Qstep 0 0.625 1 0.6875 2 0.8175 3 0.875 4 1 5 1.125 6 1.25 7 1.375 8 1.625 ... ... Table 3.1: Qstep for a few different values of QP The basic formula for quantization can be written as ! Yij Zij = round Qstep (3.16) where Yij is a coefficient of the previously transformed block to be quantized and Zij is a coefficient of the quantized block. The rounding operation does not have to be to the nearest integer, it could be biased towards smaller integers which could give perceptually higher quality. This is true for all rounding operations in the quantization. [10] As mentioned in section 3.5.1 the quantization can absorb the post-scaling (⊗Ef ) from the DCT. The unscaled output from the DCT can then be written as W = Cf XCfT (as compared to the scaled output which is Y = Cf XCfT ⊗ Ef ). [10] This gives ! P Fij Zij = round Wij (3.17) Qstep where Wij is a coefficient of the unscaled transformed block, Zij is a coefficient b2 of the quantized block and P Fij is either a2 , ab 2 or 4 for each (i,j) according to a2 ab 2 PF = a2 ab 2 ab 2 b2 4 ab 2 b2 4 a2 ab 2 2 a ab 2 ab 2 b2 4 ab 2 b2 4 where a and b are the same as in equation (3.10) in section 3.5.1. [10] (3.18) 22 Overview of H.264 PF and Qstep can then be reformulated using a multiplication factor (MF) and a division. MF is in fact a 4 × 4 matrix of multiplication factors according to A C A C C B C B MF = (3.19) A C A C C B C B where the values of A, B and C depends on QP according to QP 0 1 2 3 4 5 A 13107 11916 10082 9362 8192 7282 B 5243 4660 4194 3647 3355 2893 C 8066 7490 6554 5825 5243 4559 Table 3.2: Multiplication factor MF The scaling factors in MF are repeated for every increase of 6 in QP. The reformulation of PF and Qstep then becomes PF MF = qbits Qstep 2 (3.20) where qbits is calculated as qbits = 15 + f loor QP 6 This gives a new quantization formula according to ! M Fij Zij = round Wij qbits 2 (3.21) (3.22) which is the final form. [10] 3.5.4 Rescaling The rescaling also uses Qstep which depends on the Quantization Parameter (QP) and is the same as for quantization (see table 3.1). The basic formula for rescaling can be written as Yij0 = Zij Qstep (3.23) where Zij is a coefficient of the previously quantized block and Yij0 is a coefficient of the rescaled block. The rounding operation, as in the quantizer, does not have to be to the nearest integer, it could be biased towards smaller integers which 3.6 Deblocking filter 23 could give perceptually higher quality. This is true for all rounding operations in the rescaling. [10] As the quantization formula was reformulated the rescaling formula can also absorb the pre-scaling (⊗Ei ) and be reformulated to match the quantization formula. The new formula for rescaling where the pre-scaling factor is included can be written as Wij0 = Zij Qstep P Fij ∗ 64 (3.24) where P Fij is the same as in (3.18), Zij is a coefficient of the previously quantized block, Wij0 is a coefficient of the rescaled block and the constant scaling factor of 64 is included to avoid rounding errors while calculating the Inverse DCT. [10] Much like MF for the quantization the rescaling also uses a 4 × 4 matrix of scaling factors called V, which also incorporates the constant scaling factor of 64 introduced in (3.24). V can be written as A C A C C B C B (3.25) V = A C A C C B C B where the values of A, B and C depends on QP according to QP 0 1 2 3 4 5 A 10 11 13 14 16 18 B 16 18 20 23 25 29 C 13 14 16 18 20 23 Table 3.3: Scaling factor V The scaling factors in V are like MF repeated for every increase of 6 in QP. With V the rescaling formula can be written as Wij0 = Zij Vij 2f loor(QP/6) (3.26) which is the final form. [10] 3.6 Deblocking filter When using block coding algorithms such as DCT, blocking artifacts can occur. This is unwanted because it lowers the visual quality and prediction performance. The solution to this is to add a filter than removes these artifacts. The filter is placed after the IDCT in the encoding loop which can be seen i figure 3.1. The filter is used on both luma and chroma samples of the video sequence. [10] 24 Overview of H.264 E F G 3 H 4 A B C D 1 (a) 2 (b) Figure 3.11: Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) The deblocking filter in H.264 has 5 levels of filtering, 0 to 4, where 4 is the option with the strongest filtering. The filter is actually two different filters where the first filter is applied on level 1 to 3 and the second on level 4. Level 0 means that no filter should be applied. The filter level parameter is called boundary strength (bS). The parameter depends on the current quantization parameter, macroblock type and the gradient of the image samples across the boundary. There is one bS for every boundary between two 4x4 pixel block. The deblocking filter is applied to one macroblock at a time in a raster scan order throughout the frame. [5] p3 p2 p1 p0 p3 p2 p1 p0 q0 q1 q2 q3 q0 q1 q2 q3 Figure 3.12: Pixels in blocks adjacent to vertical and horizontal boundaries When applying the deblocking filter on a macroblock it is done in a special order which is illustrated in figure 3.11. The filter is applied on vertical and horizontal edges as shown in figure 3.12. Where p0 , p1 , p2 , p3 , q0 , q1 , q2 , q3 are pixels from two neighboring blocks, p and q. The filtering of these pixels only takes place if equation (3.27), (3.28) and (3.29) are fulfilled. 3.7 Entropy coding 25 |p0 − q0 | < α(indexA ) (3.27) |p1 − p0 | < β(indexB ) (3.28) |q1 − q0 | < β(indexB ) (3.29) indexA = M in(M ax(0, QP + Of f setA ), 51) (3.30) indexB = M in(M ax(0, QP + Of f setB ), 51) (3.31) The values of α and β are approximately defined to equation (3.32) and equation (3.33). x α(x) = 0.8(2 6 − 1) (3.32) β(x) = 0.5x − 7 (3.33) Note that in equation (3.30) and (3.31) it can be seen that the filtering is dependent on the Quantization Parameter. The different filters applied are 3-,4and 5-tap FIR filters which are further described in. [5] 3.7 Entropy coding The H.264 standard supports two different entropy coding algorithms, Contextbased Adaptive Variable Length Coding (CAVLC) and Context-based Adaptive Binary Arithmetic Coding (CABAC). CABAC is the most efficient of these two standards but it requires higher computational complexity. Bitrate savings of CABAC can be between 9% and 14% compared to CAVLC[7]. CAVLC is supported in all H.264 profiles but CABAC is only supported in the profiles above extended. [10] Chapter 4 Overview of the ePUMA Architecture This chapter covers an introduction to the ePUMA processor architecture. The memory hierarchy, master core, Sleipnir core, the direct memory access controller and the simulator will be covered. 4.1 Introduction to ePUMA Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access (ePUMA) is a multi-core DSP processor architecture with 1 master core and 8 calculation cores. The master core handles the Direct Memory Access (DMA) communications. The slave core, which is also called Sleipnir, is a 15-stage pipelined calculation core. 4.2 ePUMA Memory Hierarchy The ePUMA memory hierarchy consists of three levels where the first level is the off-chip main memory, the second level is the local storage of the master and slaves and the third and final level is the registers of the master and slave cores. In figure 4.1 an illustration of how each core is connected to the on-chip interconnection is depicted. The on-chip interconnection is in turn connected to the off-chip main memory. The main memory is addressed with both a high word of 16 bits and a low word of another 16 bits which means that a 32-bit addressing is used where each address corresponds to a word of data. 27 28 Overview of the ePUMA Architecture Off chip main memory Level 1 On chip interconnection Registers Registers Registers Master Core Sleipnir Core Sleipnir Core LVM 3 LVM 2 LVM 1 CM ... PM Sleipnir 7 LS LVM 3 LVM 2 LVM 1 CM Sleipnir 0 LS PM DM 1 DM 0 PM Master LS Level 2 Level 3 Figure 4.1: ePUMA memory hierarchy The on-chip network is depicted in figure 4.2 where N0 to N7 are interconnection nodes. As can be seen from the figure the nodes are connected both to the master and the respective Sleipnir core but also to other nodes. This gives the ability of transferring data between Sleipnir cores and even pipeline the cores. With this setup data can be transferred in any way and combination that does not overlap. Sleipnir 0 Sleipnir 3 Sleipnir 5 Sleipnir 1 Sleipnir 2 N0 N1 N2 N3 Master DMA Main Memory N4 N5 N6 N7 Sleipnir 6 Sleipnir 4 Sleipnir 7 Figure 4.2: ePUMA star network interconnection 4.3 Master Core 4.3 29 Master Core The master core is for the moment based on a processor called Senior. This processor has been around on the Division of Computer Engineering for some years now and is used in some courses for educational purpose. The Senior processor is a DSP processor which means it got a Multiply and ACcumulate (MAC) unit and other DSP related capabilities. To accomplish a possibility to serve as a master core memory ports for DMA controller and interrupt coming from the DMA and Sleipnir cores have been added. 4.3.1 Master Memory Architecture The master core has 2 RAMs and 2 ROMs which are called Data Memory 0 (DM 0) and Data Memory 1 (DM 1). These memories are the local storage for the master core. The ROMs start at address 8000 on respective memory. This gives 7F F F = 32767 words in each RAM to work with. For calculation the master core has 32 16-bit registers that could be used as buffers. There are also a number of special registers such as 4 address registers, registers for hardware looping and registers for support of cyclic addressing in address register 0 and 1. Address register 0 and 1 also supports different step sizes. 4.3.2 Master Instruction Set Programming guide and instruction set for Senior can be found in [9] and [8] even though they might not be totally accurate because of the modifications for the ePUMA project. The masters instructions set is in large the same as the Senior instruction set. It is a standard DSP instruction set with support for a convolution instruction which multiplies and accumulates the results. To speed up looping a hardware loop function called repeat is included. All jumps, calls and returns can use 0 to 3 delay slots. The number of delay slots specifies how many instructions after the flow control instruction that will be executed. If not all delay slots are used for useful instructions, nop instructions will be inserted in the pipeline. 4.3.3 Datapath The datapath of the master consists of a 5-stage pipeline which can be seen in figure 4.3. There is only one exception to this, the convolution instruction (conv) uses a 7-stage pipeline but a figure of this is omitted for lack of relevance. The datapath is advanced enough for scalar calculations, larger computational loads should be delegated to the Sleipnir cores. In table 4.1, originally found in [9], a description of the pipeline stages is presented. 30 Overview of the ePUMA Architecture Next PC P1 PM P2 Decoder . . . RF P3 OP. SEL AGU ALU flags P4 ALU ACR, MAC flags P5 * DM 0 DM 1 Cond. Check + Figure 4.3: Senior datapath for short instructions Pipe P1 P2 P3 P4 P5 RISC-E1/E2 IF: Instr. Fetch ID: Instr. Decode OF: Operand Fetch EX1: Execution(set flags EX2: Only for MAC, RWB RISC Memory load/store IF: Instr. Fetch ID: Instr. Decode OF+AG: Compute addr MEM: Read/Write WB: Write back (if load) Table 4.1: Pipeline specification 4.4 Sleipnir Core Sleipnir is the name of the calculation core. In the ePUMA processor there are 8 of them. The Sleipnir is a Single Instruction Multiple Data (SIMD) architecture which in this case means it can perform vector calculations. Each full vector consists of 128 bits and is divided into 8 words of 16 bits which can run through the pipeline in parallel. The datapath of the Sleipnir core has 15 pipeline stages. The pipeline length of an instruction is variable depending on the choice of operands. 4.4 Sleipnir Core 4.4.1 31 Sleipnir Memory Architecture The Sleipnir core has 3 memories where 2 of them are connected to the core and the third memory is connected to the DMA bus. The memories are called Local Vector Memories (LVMs). By being able to swap which memories that are connected to the processor and which memory that is connected to the DMA better utilization can be reached and a lot of the transfer cycle cost can be hidden. Constant Memory Each Sleipnir is also provided with a Constant Memory (CM) for use of constants during runtime. This memory can be used for different tasks such as scalar constants or permutation vectors. All constants that will be used during runtime can be stored in the CM. The memory can contain up to 256 vectors. Local Vector Memory The Local Vector Memories (LVM) are the local memories of the Sleipnir core. As described above each core has access to 2 LVMs at runtime. These memories are 4096 vectors large, where each vector is 128 bits wide. The memories have one address for each word of 16 bits. The memories consist of 8 memory banks, one for each word in a vector. The constant memory can be used to address the LVMs according to the values stored in the constant memory. The constant memory addressing of the LVMs can be used to generate a permutation of data which can be used for e.g. transposing a matrix. Vector Registers File There are 8 Vector Registers (VR) in the Vector Register File (VRF), VR0 to VR7, for use in computations during runtime. Each word can be obtained separately, it is also possible to obtain a double word and half vector both high and low in each of the 8 vector registers. The different access types are listed in table 4.2, originally found in [4]. Syntax vrX.Y vrX.Yd vrX{h,l} vrX Size 16-bit 32-bit 64-bit 128-bit Description Word Double word Half vector Vector Table 4.2: Register file access types Special Registers There are 4 address register ar0-ar3 which can be used to address memory in the LVMs. There are also 4 configuration registers for these 4 address registers. The subset of these registers are values for top, bottom and step size which can 32 Overview of the ePUMA Architecture be used when addressing memories in all kinds of loops. The different increment operations are listed in table 4.3, originally found in [4]. arX+=C arX-=C arX+=S arX+=C% arX-=C% arX+=% Fixed increment; C = 1,2,4 or 8 Fixed decrement; C = 1,2,4 or 8 Increment from stepX register Fixed increment with cyclic addressing Fixed decrement with cyclic addressing Increment from stepX with cyclic addressing Table 4.3: Address register increment operations The addressing of the two LVMs can be done with one of the four address registers, immediate addresses, vector registers or in combination with the constant memory, to form advanced addressing schemes as shown in table 4.4, originally found in [4]. Mode# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Index arX arX arX arX 0 0 0 0 0 0 0 0 arX arX arX 0 Offset 0 0 0 0 vrX.Y vrX.Y vrX.Y vrX.Y 0 0 0 0 0 vrX.Y imm16 imm16 Pattern 0,1,2,3,4,5,6,7 cm[carX] cm[imm8] cm[carX + imm8] 0,1,2,3,4,5,6,7 cm[carX] cm[imm8] cm[carX + imm8] vrX cm[carX] cm[imm8] cm[carX + imm8] vrX 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 Syntax example [ar0] [ar0 + cm[car0]] [ar0 + cm[10]] [ar0 + cm[car0 + 10]] [vr0.0] [vr0.0 + cm[car0]] [vr0.0 + cm[10]] [vr0.0 + cm[car0 + 10]] [vr0] [cm[car0]] [cm[10]] [cm[car0 + 10]] [ar0 + vr0] [ar0 + vr0.0] [ar0 + 1024] [1024] Table 4.4: Addressing modes examples Program memory The program memory (PM) can contain up to 512 instructions. It can be loaded from the main memory by issuing a DMA transaction. The program that is loaded into the Sleipnir PM is called a block. A kernel is a combination of master code and blocks. A block can utilize several Sleipnir cores with internal data transfers. Blocks can however not communicate with cores outside the block and can not be data dependant on any other block running at the same time. 4.4 Sleipnir Core 33 If for some reason the Sleipnir block code is larger than 512 lines of instructions it can be divided into two programs and the memory can be transferred between two Sleipnir cores. For this to work code is needed in the master to keep track of the cores and move data to the next core for further processing. When developing a new block or kernel it can sometimes be good to have a little extra memory. Therefore it is possible to increase the size of the PM in the simulator. 4.4.2 Datapath The datapath of the Sleipnir slave core is a 8-way 16-bit datapath. The datapath is divided into 15 pipeline stages and is depicted in figure 4.4. A more detailed version of the datapath can be found in [2]. A1 Instr. Fetch A2 Instr. Decode E1 B1 E2 B2 E3 B3 E4 B4 C1 D1 D2 D3 D4 CM Addressing LVM Scalar Addressing CM x CM y LVM Vector Addressing LVM x LVM y VRF SPRF Operand Selection Operand Formatting Multiplier ALU 1 ALU 2 Figure 4.4: Sleipnir datapath pipeline schematic The datapath includes 16 16x16-bit multipliers and two Arithmetic Logic Units (ALU) connected in series. Simpler instructions can bypass the first ALU and by that become a shorter instruction which saves some execution time. These bypasses can be seen in stage D1 to D4 in figure 4.4. Some instructions use a very short datapath such as the jump instruction which is executed in stage A2. This makes the use of precalculated branch decisions unnecessary. Stage E1 to E4 can be described as the write back stage and therefore it follows after stage D4. Stage D3 and D4 are very similar but provides the core with the possibility of performing summation of a complete vector and similar tasks. 34 Overview of the ePUMA Architecture 4.4.3 Sleipnir Instruction Set The instruction set used is application specific. The instruction set includes no move or load instructions for data. These functions are all included in one instruction which is called copy. Operands and instructions can be combined in different ways with variable pipeline length as a result. The pipeline length depends on e.g. where the input operands are fetched from, where the result will be stored and if the instruction uses or bypasses the first ALU and multipliers. Instruction names are built upon what data they affect and how. For example the instruction vcopy m0[0].vw m1[0].vw copies a vector from memory 1 address 0 to memory 0 address 0. If the instruction scopy would be used instead it would only copy a scalar word. Another example is the add instruction. If vaddw m0[0].vw m1[0].vw vr0 is used two vectors will be loaded from both m1 and vr0. The .vw after the memory address denotes that the vectors will be added word wise, that means they will be considered as eight words. This means that the processor can carry out 8 additions per clock cycle. [4] 4.4.4 Complex Instructions To reach better performance results the datapath has to be utilized as much as possible, especially in the inner loops of the critical path. To be able to reach this better performance, new specialized instructions that perform several smaller tasks could be implemented. The result of this is that by pipelining several of these new complex instructions more work can be done in less time and the program will reach an increased throughput. Things that have been considered when deciding upon accelerating certain parts of code are listed below. • Motivation – Why should the acceleration be done • Description – What is going to be accelerated • Extra hardware needed – What extra hardware is needed for acceleration of the specific task • Profiling and usage – Is the task used a lot and therefore worth accelerating • Extra hardware cost – What is the cost of the extra hardware • Cycle gain – How many cycles can be saved • Efficiency – How efficient is the new solution in terms of cost per gain in performance 4.5 DMA Controller The Direct Memory Access(DMA) controller is used to load and store data to and from an off-chip memory. The DMA can transfer a 128-bit vector to one of the 4.6 Simulator 35 Sleipnirs every cycle. It can also broadcast data to one or more Sleipnirs. If a block is to be loaded in to two or more cores it can be broadcasted and the process will not lose cycles by loading each block separately to each core. This will save time both because of the time it takes to copy the data but also because it takes some cycles to configure and start the DMA transaction. Sleipnir Local Store DMA PM NoC CM Switch 2 LVM 1 Switch 1 Sleipnir Core LVM 2 LVM 3 Figure 4.5: Sleipnir Local Store switch As mentioned before there are 3 memories belonging to each core. There are 6 different setups for how the memories are connected to the DMA and the core. The switch which control this is illustrated in figure 4.5, originally described in [2]. There is also a switch for selecting if LVM, PM or CM should be connected to the DMA. This switch is used and changed accordingly when programming the Sleipnir cores PM, CM or LVM. To initiate a DMA transaction the DMA unit needs to be configured. This configuration includes start addresses in both memories, number of vectors to be transferred, how to access data, step size in memory, switch configuration and broadcast configuration. The DMA has support for 2D accesses in main memory which can be helpful when advanced access patterns are used. When the configuration of the DMA unit is done the task can be started. 4.6 Simulator The ePUMA architecture has a full system simulator available. The simulator is bit and pipeline true. Simulations can be done on either one standalone Sleipnir core or on the full system simulator where master core and all 8 Sleipnir cores are included. 36 Overview of the ePUMA Architecture The simulator can be invoked by a python script. The simulator has a number of functions that could be used to access LVM contents, address registers, vector registers, program counter and the instruction that is being executed [3]. This can be used both for debugging and profiling. Interrupts from DMA and the Sleipnir cores can also be caught by the same python script. Opportunities like pre-processing of the input and post-processing of the results directly in the python script are also available. The simulator has support for different modes of simulation, either simulation until an event occurs or simulation of one cycle at a time. When simulating until an event happens, these events need to be enabled in the simulator. Events that are possible to enable are starting and stopping of a specific Sleipnir core, memory out of range and data hazards such as read before write. Simulating one cycle at a time offers more opportunities to evaluate each step in an execution. To get the simulator to carry out a full system simulation it needs input such as master code, Sleipnir code and the data that is going to be used during runtime. These are possible to add before the simulation begins. Allocating memory for results in main memory is also possible. Chapter 5 Elaboration of Objectives This chapter gives a more detailed task specification by using knowledge acquired from the previous theoretical chapters. It also includes the method and the taken procedure. 5.1 Task Specification The main task at hand is to evaluate a new processor architecture and how capable it is in aspect of H.264 video encoding using the available system simulator. The evaluation will be done by developing selected parts of an H.264 encoder. Most weight should be put into evaluating the more computational intensive parts which are likely to make out the bottleneck of the encoding. To implement an encoder, or parts of it, that uses the H.264 standard a thorough understanding of both the H.264 standard’s core parts and the ePUMA processor architecture, tool chain and instruction set is needed. This information and understanding has to be acquired first. The video focused upon will be 1080p full high-definition (HD) video at a rate of 30 frames per second (FPS) using the 4:2:0 sampling format encoding which was described in section 2.2. The video frames that calculations will be performed on are presumed to be stored with 8 bits per pixel in the main memory. Once performance results have been acquired, possible areas of improvement can be exposed. Different ways of improvement can then be compared, both in terms of performance improvement and the estimated extra hardware needed, to give a measurement of efficiency. The results will also be compared to the results from the H.264 encoder for the STI Cell architecture which are presented in [15]. Other parts that will be evaluated include the Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), Quantization and Rescaling. 37 38 Elaboration of Objectives 5.1.1 Questions at Issue The following questions were derived from the purpose in section 1.2 and the task specification. • Is it possible to perform real-time full HD video encoding at 30 FPS using the H.264 standard in the ePUMA processor? • Would it be possible to modify the processor architecture to reach better performance and if so, would it be worth the cost of the potentially added hardware? • What are the cycle costs compared to the STI Cell H.264 encoder? 5.2 Method The main method used to conduct the work has been to use the ePUMA system simulator. The simulator was invoked from a script written in the Python programming language. This gave enough flexibility to enable measurement of all the results as well as ability to make testing automatic within that same script. If the sole purpose of this thesis would have been to give performance measurements the method might have had other candidates such as hand calculations of cycle costs. As the purpose for this thesis also includes functional implementations using the simulator is the choice that offers the best validity if used correctly. If the implementation of the simulator is correct according to the proposed architecture, it will give measurements of high reliability. This is based on the fact that the simulator is pipeline true, cycle- and bit-correct. 5.3 Procedure The taken procedure while working on this thesis started with a study of video coding, the H.264 standard and the ePUMA processor architecture. Once the required information had been acquired a functional stand-alone Sleipnir block for motion estimation was developed. When the block was found to be correct and working, code for the master was developed. Then the master could start to run the motion estimation using the Sleipnir block. From this point the motion estimation kernel was developed with various stepwise improvements. The master code was also developed to be able to run the different versions of the kernel using a variable number of slave cores. Then the construction of the other Sleipnir blocks such as DCT, IDCT, Quantization and Rescaling began. Once all blocks were implemented performance measurements could be acquired and the results could be analyzed to give conclusions and answer the questions at issue. Chapter 6 Implementation This chapter covers how the implementation of different kernels and blocks were done, how they evolved, the different decisions that were made and why. 6.1 Motion Estimation Motion estimation was found to be the prime target for performance evaluation as it, in nearly all cases, takes up the majority of the encoding cycle time. All implementations of motion estimation is done for a 65 macroblocks high and 118 macroblocks wide frame. The reason for this is because it simplifies the implementation as the corners and sides of a frame make out special cases. The number of macroblocks left out by doing this simplification is 430 compared to the total number of 120 ∗ 67.5 = 8100 macroblocks for a full HD frame. This corresponds to 5.31% and still leaves 7670 macroblocks to perform calculations on. The search area was chosen as (−15, 15)×(−15, 15) according to what was used in [15] to yield as comparable results as possible. Another simplification of the motion estimation is that it is only performed on entire macroblocks, no further division into e.g. 16 × 8, 8 × 8 or 4 × 4 pixels is performed. The reason for this is that it might not be feasible to perform these calculations on a low-power architecture such as ePUMA without increasing the clock frequency. Doing so would be counter productive in a low-power point of view and, even if it would be doable, might not be applicable to hand held devices running on batteries. 6.1.1 Motion Estimation Reference In order to evaluate the results produced by the motion estimation kernels a reference motion estimation program was written in the Python scripting language. By comparing the resulting motion vectors and costs in an automated fashion the functionality of the kernels could be verified with little effort. 39 40 Implementation 6.1.2 Complex Instructions The function of the innermost loop of motion estimation can be described as follows: 1. 16 ∗ 16 = 256 subtractions of 8-bit unsigned numbers. 2. Calculate the absolute value of each subtraction result. 3. Sum all absolute values together to one final sum. This gives a total of 256 subtraction operations (SUB), 256 absolute value (ABS) operations and 255 addition operations (ADD) which is equal to 767 operations. This theoretically corresponds to 32 vector word SUB instructions, 32 vector word ABS instructions and 37 vector word SUM instructions in the Sleipnir core. Theoretically those instructions would need a total of 32 + 32 + 37 = 101 cycles in the Sleipnir core. By examining the Sleipnir datapath, as can be seen in figure 4.4, it can be found that several of the necessary operations could be done in series in a pipelined fashion. By exploiting this a new complex instruction could be constructed as mentioned in section 4.4.4. In addition, by having the operand selection and operand formatting parts of the pipeline able to fetch 8-bit unsigned numbers from the operands and feed them to the datapath as 16-bit unsigned numbers, a further reduction of cycle time could be achieved. By utilizing the datapath to this extent two scalar words will be the partially summed up result from the complex instruction. This means another 9 vector word SUM instructions will still be needed which gives a total theoretical computation time of 32 + 9 = 41 cycles for calculating one macroblock sum of absolute difference (MB SAD). By studying the hexagon search algorithm (section 3.4.1) it can be seen that the algorithm will need to calculate the sum of absolute difference between two macroblocks (MB SAD) a number of times equal to 7 + 3 ∗ n + 4, where n is the number of steps taken. It is also known that there will be 8100 macroblocks to perform a hexagon search upon in each frame. A summary of the considerations taken, as mentioned in section 4.4.4, is listed below. • Motivation – Innermost loop of motion estimation. • Description – Perform absolute difference and partial sum. • Extra hardware needed – None, or possibly operand 8-bit selection. • Profiling and usage – Used (7 + 3 ∗ n + 4) ∗ 8100 times per frame. • Extra hardware cost – None, or affordable. • Cycle gains – Theoretically 101 − 41 = 60 for each MB SAD. • Efficiency, gain per cost – Very high. 6.1 Motion Estimation 41 Having analyzed the innermost loop of the motion estimation leads to the conclusion that using complex instructions could give a real boost to performance. The new instructions specific for motion estimation were named HVBSUMABSDWA and HVBSUMABSDNA which can be read out as “Half Vector Bytewise SUM of ABSolute Differences Word Aligned” and “Half Vector Bytewise SUM of ABSolute Differences Not word Aligned” respectively. The proposed hardware setup of the datapath of these two instructions are depicted in Appendix A.1 and A.2. As can be seen from the figures the instructions do not use the full width of the operands, because the data is stored bytewise in the memory. The datapath will still be fully utilized as the 8-bit input pixel values are promoted to 16-bits values in the 8 computation lanes of the Sleipnir datapath. In addition to the motion estimation instructions, two more instructions can follow without any essential additional cost. These instructions were named HVBSUBWA and HVBSUBNA which can be read out as “Half Vector Bytewise SUBtraction Word Aligned” and “Half Vector Bytewise SUBtraction Not word Aligned” respectively. These instructions are used in motion compensation as the subtraction results have to be kept intact to produce the residue frame. The implementation of these instructions can be seen in Appendix A.3 and A.4. 6.1.3 Sleipnir Blocks The hexagon search Sleipnir block was the first part to be implemented, at first only focusing on performing calculations on one macroblock at a time. The input needed to perform calculations is one macroblock from the new frame and a larger chunk of data from the previous frame, the reference, that makes out the search area. All motion estimation blocks are divided into smaller functions and the program flowchart is depicted in figure 6.1. When execution starts the program will calculate the Sum of Absolute Difference (SAD) for the first 7 search points MID, LEFT, RIGHT, UP LEFT, UP RIGHT, DOWN LEFT and DOWN RIGHT as shown in figure 6.1. Once the first 7 SAD costs have been calculated the program reaches the main loop and the MIN function determines which cost is lowest and moves on to one of the 7 corresponding MIN functions. The 6 MIN functions (MIN LEFT, MIN RIGHT, MIN UP LEFT, ...) updates the motion vectors and data addresses and then moves on to the corresponding 3 new search points. Once the SAD costs of these 3 new search points have been calculated the MIN function is again used to find the new minimum cost and the loop continues. The MIN MID state is reached if the middle point was found to be the search point with the lowest SAD cost. When this happens Phase 2 (P2) of the algorithm starts which means that the searchpattern is changed to the small hexagon (figure 3.7). Once the final 4 search points have been calculated the smallest cost amongst them is found by the P2 MIN function and the final motion vector is calculated. For the Sleipnir blocks that does not use the Motion Compensation (MC) function the DONE/RESTART state is reached and if MC is used it will be calculated before the block finishes. 42 Implementation The blocks calculating on one macroblock reach the DONE stage and finalize their execution. The RESTART function is naturally only used by the blocks calculating on more than one macroblock per execution. If the DONE/RESTART function is reached the program starts over from START/RESTART if there are more macroblocks left to compute, otherwise they reach DONE and finalize their execution. The “P2” in some function names indicate that they are used in phase two of the search algorithm when the small hexagon pattern, discussed in section 3.4.1, is used. START/ RESTART MID LEFT RIGHT UP LEFT DOWN RIGHT MIN UP RIGHT DOWN LEFT MIN DOWN RIGHT MIN DOWN LEFT MIN UP RIGHT MIN UP LEFT MIN RIGHT MIN LEFT MIN MID DOWN RIGHT LEFT RIGHT UP LEFT RIGHT LEFT P2 UP RIGHT DOWN LEFT UP RIGHT LEFT DOWN RIGHT DOWN LEFT P2 DOWN DOWN LEFT DOWN RIGHT UP LEFT UP RIGHT UP RIGHT UP LEFT P2 LEFT P2 RIGHT DONE/ RESTART MC P2_MIN Figure 6.1: Motion estimation program flowchart The MC part in the final stage of figure 6.1 is as mentioned only included in the final block but all other stages are common to all blocks. The computations performed by the different functions in figure 6.1 are depicted in figure 6.2. The Finite State Machine (FSM) included in figure 6.2 is the state machine presented in figure 6.1. Here “SAD calculating functions” are e.g. LEFT, RIGHT, UP LEFT and “Min functions” are e.g. MIN, P2 MIN, MIN LEFT, MIN UP RIGHT. 6.1 Motion Estimation 1. Check Odd/Even Flag 2. Calculate Data Adress MC Calculating Functions EVEN MC ODD MC Store Done or Next MB 43 SAD Calculating Functions FSM START/RESTART 1. Init loop counter, motion vector and adress registers 2. Copy data MB to other LVM Min Functions 1. Check if out of bounds 2. Check Odd/Even Flag 3. Calculate Data Adress ODD 1. Find Min Value 2. Update Odd/Even Flag 3. Update Base Adress EVEN Sum and store Out of bounds Figure 6.2: Motion estimation computational flowchart When the block starts it first sets the loop counter to zero, sets the motion vector to (15,15) (to start in the middle) and initializes the addresses used to access the reference macroblocks stored in the Local Vector Memory (LVM). Then the current data block is copied from the input LVM to the other LVM to be able to perform calculations easier by accessing one memory for the reference and the other for the data. By comparing the motion vector that corresponds to the current search points position and the minimum and maximum values allowed, 0 and 30 respectively, search points that are out of bounds can be detected. If the search point is found to be out of bounds the calculation of that macroblocks SAD will not take place and the program will continue to the next search point. In figure 6.2 ODD and EVEN are the computational functions which use the new instructions HVBSUMABSDWA and HVBSUMABSDNA to calculate the Sum of Absolute Difference (SAD) of the macroblocks. After that the results are summed up to a single integer value and stored in memory on one of the 11 (7 + 4) addresses dedicated for the search points in the large and small hexagon search pattern. The MIN and P2 MIN functions can then find the smallest value of the costs previously stored in memory on a part of the specific addresses mentioned above. MIN for instance examines the first 7 costs from the large hexagon pattern and P2 MIN the final 4 search points plus the middle point again. Once the minimum value is found it will be known if the ODD/EVEN flag will have to be updated or not. If the search point moves an even number of pixels the flag will be unchanged, if it moves an odd number of pixels the flag will be inverted to indicate the change. The MC calculation is very similar to the SAD calculation, the difference is that it can not be out of bounds and that the result is one complete macroblock of the residue which consists of 32 vectors of 16-bit integers. Once the MC calculation is finished the execution will either finish or restart calculations on the next macroblock. 44 Implementation Simple Flow Control The simple program flow controller was implemented with a series of conditional jump instructions and a status flag stored in memory to move between functions in a correct order as depicted in figure 6.3. START MID JUMP FORWARD > 7 JUMP FORWARD > 4 LEFT == 1 RIGHT == 2 UP LEFT == 3 UP RIGHT == 4 DOWN LEFT == 5 DOWN RIGHT == 6 MIN == 7 JUMP FORWARD > 11 MIN MID == 8 MIN LEFT == 9 MIN RIGHT == 10 MIN UP LEFT == 11 JUMP FORWARD > 14 MIN UP RIGHT == 12 MIN DOWN LEFT == 13 MIN DOWN RIGHT == 14 P2 UP == 15 P2 DOWN == 16 P2 LEFT == 17 P2 RIGHT == 18 P2 MIN == 19 DONE Figure 6.3: Hexagon search program flow controller The status flag is updated by each function to enable execution of the next function in order. To enable the ability to only recalculate the three necessary positions other flags are set, by the corresponding MIN functions, for each position if it should be recalculated in the next pass or not. The block was verified to work as intended by comparing the produced results with the results produced by the Python motion estimation reference program described in section 6.1.1. Advanced Flow Control When the block using the simple program flow control was functioning it became more clear that implementing functionality for call and return in the slaves could yield an increase in performance. The implementation of a relatively simple hardware stack with only 4 levels as shown in figure 6.4 was the result. In figure 6.4 the original programflow control consists of the blocks inside the dashed box. The additional hardware added by call and return is the Call / Return Controller block and the 4 address registers it uses. The added hardware cost of these parts is very reasonable, the increase in program flow controlability and the performance gain that follows makes this a well worth addition. 6.1 Motion Estimation 45 PC Return Addr 1 Return Addr 2 Return Addr 3 PM Pipeline Reg Call / Return Controller Return Addr 4 PC FSM Instr. Decoder Figure 6.4: Proposed implementation of call and return hardware Once functionality for call and return instructions were added to the simulator a new hexagon search block that utilizes these new features was written. With the call and return functionality the somewhat primitive program flow controller could be replaced with function calls. The program flowchart in figure 6.1 is still valid for blocks using the advanced program flow control. This is because the functionality of the blocks has not changed. Multiple Macroblocks Once the call and return version of the block was completed and confirmed as working, further development was based on making each Sleipnir core perform calculations on several macroblocks. The number of macroblocks to calculate the motion vectors for during one Sleipnir block execution were chosen to 5 and 13 because they both divide 65 evenly. Two block versions, one for 5 and one for 13 macroblocks, were implemented and tested. One of the most substantial benefits of doing multiple macroblock calculations at a time is the opportunity to exploit data reuse. 1 1 2 3 4 5 2 Figure 6.5: Reference macroblock overlap For each extra macroblock beyond the first an amount of data transfer equal to 6 macroblocks (6 ∗ 16 = 96 vectors) can be saved. The reason for this is that a 3 × Height search area avoids a vertical overlap, the shaded areas, as depicted in figure 6.5. 46 Implementation The Sleipnir block calculating 13 motion vectors during each execution needs a data input equal to the 13 data macroblocks but also the search area for them which makes out an 3 × 15 area of macroblocks. There is still a considerable horizontal overlap in this setup but the advantage over calculating one macroblock per execution by transferring each data macroblock and its 3 × 3 macroblocks of search area is considerable. Reference Frame in Main Memory Reference Columns Sleipnir 0 Sleipnir 1 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 119 120 121 122 123 119 120 121 122 123 119 120 119 120 121 Data Overlap Figure 6.6: Reference macroblock partitioning for 13 data macroblocks In figure 6.6 the data partitioning of a frame in the main memory where only the upper right corner of the full frame is shown. The numbered areas illustrate the overlay of the data macroblocks being calculated by each Sleipnir block execution. The data macroblocks are taken from the current frame not the reference frame shown in figure 6.6. As the frame contains 118 columns to be calculated the next row of 13 macroblocks starts as number 119. 6.1 Motion Estimation 47 Motion Compensation Once a motion estimation block has found the best match only a little extra time would be needed to calculate the motion compensated residue of that macroblock. By taking advantage of this and adding this functionality to the motion estimation block motion compensation can be achieved for a very low overhead cost as all information needed is already present. To perform motion compensation the block, once it has found its best match, has to perform a subtraction between two macroblocks. This adds up to 256 subtraction operations or 32 vector subtraction instructions in the Sleipnir. The result will be stored as 32 vectors of 8 16-bit integers and copied back to the main memory along with the motion vectors. As mentioned in section 6.1.3 the motion compensation block uses the HVBSUBWA and HVBSUBNA instructions to speed up the calculation of the residue macroblocks. 6.1.4 Master Code The implementation of the master code was started once the Sleipnir code was found to be working. The masters tasks include keeping track of how many more macroblocks to do calculations on, set up all DMA data transfers to and from the Sleipnir cores and divide the workload of the motion estimation between them. In figure 6.7 the program flow of the master is shown. In the prolog the stack pointer, DMA- and slave-interrupts, number of macroblocks to compute, address registers and configurations of data storage in the main memory are set up. In the prolog the master also loads the program and constants into the Sleipnir cores’ memories. Configure DMA Prolog Start DMA Start Sleipnir Copy Results Find Next Available Sleipnir YES Epilog NO More MBs to Compute? Figure 6.7: Master program flowchart In the Configure DMA stage the coming DMA transfers for data and reference are configured. The addresses for where the result should be written is saved to DM0 at the label called Results which can be found in figure 6.8a. In the Start DMA stage the DMA transfers are started and the addresses for the location of the next data block is calculated. There are two DMA transfers that are completed and therefore a wait for the first to finish is performed. During this wait the calculation of the next address is hidden. These addresses are saved to DM0 in the RAM block DMA data and DMA ref for easier configuration of DMA when reaching the Configure DMA step next time. 48 Implementation In the Start Sleipnir stage the switches of the memory are set in the correct place before starting the Sleipnir core. After the Sleipnir core has been started it is time to find a new available Sleipnir core to fill with data and start. The Find Next Available Sleipnir stage iterates over the Sleipnirs until it finds one that is free. The iteration in the Find Next Available Sleipnir stage gives the ability to rather quickly find any Sleipnir that has finished execution. Sleipnir cores are chosen in an first free first served fashion so that Sleipnir 0 will have the highest priority and Sleipnir 7 will have the lowest priority. The program uses status flags for each core to know what it is currently doing. These flags are used when finding a new free core. If the Sleipnir block is running and then finishes it will send an interrupt to the master which will change the status flag for the core in an interrupt routine. Next time the master is looking for a free Sleipnir it will find that the Sleipnir has finished, due to the flag value, and needs to have its results copied back. When a Sleipnir core has finished execution the results from that core needs to be copied back to the main memory. This is done in Copy Results stage. The information where the results should be copied to are fetched from DM0 and written to the Copy Back (CB) allocated memory DMA CB seen in figure 6.8a. The DMA unit is then configured with the information that can be found in DMA CB. When the DMA transfer is finished it is time to load the Sleipnir with new data to perform calculations on. This will of course happen in Configure DMA and Start DMA. Using this type of setup of the program enables out of order execution of the Sleipnir cores which is suitable for a search algorithm such as motion estimation that has a highly variable execution time. If there are more macroblocks available for calculation the loop will continue to the configure DMA stage again. If all macroblocks are finished the Epilog is activated. In the Epilog the master waits for all Sleipnirs to finish their executions. The last results are then copied to main memory and the kernel is finalized. DM 0 Main Memory RAM 0 ROM 0 DMA Data DMA PM Data (current frame) DMA CB DMA CM Refence Frame DMA Ref Results Results PM CM (a) (b) Figure 6.8: Memory allocation of data memory in the master(a) and main memory allocation(b) 6.2 Discrete Cosine Transform and Quantization 49 The memory allocation for the masters data memory can be found Figure 6.8a. It is a simple setup where three blocks of DMA configuration setting are stored in the top of RAM 0. During runtime the master points the DMA firmware to the different memory blocks where it reads the DMA settings. The last block in RAM 0 is used to store result pointers for main memory. These result pointers are needed because the likelihood of the blocks finishing in an out of order fashion. The allocation overview of the main memory can be seen in figure 6.8b. Data and Ref are data for the frames to encode. The results block is memory allocated for the residues and for motion vectors. The last two blocks of data are memory allocated for the Sleipnir block and its constants. Main Memory Result in Main Memory Sleipnir ME Execution Result 3 Sleipnir 1 Result 2 Result 0 Task 4 Task 3 Task 2 Task 1 Task 0 Sleipnir 0 Sleipnir 2 Task 122 Task 121 Task 120 Task 119 Task 118 Sleipnir 3 Sleipnir 4 Time Current Time Figure 6.9: Sleipnir core motion estimation task partitioning and synchronization In figure 6.9 the motion estimation task partitioning and synchronization between Sleipnir cores is shown. Task 0, Task 1 and so on contains the copying of both the data macroblocks and the reference macroblocks used for the search area and performing motion estimation on them. The number of macroblocks used as data macroblocks in the tasks can be 1, 5 or 13 and the number of reference macroblocks can be 9, 21 or 45 respectively. Shown in the figure is also examples of the Sleipnir cores execution times and the resulting out of order completion of the tasks. 6.2 Discrete Cosine Transform and Quantization The output from the motion compensation block will be a motion compensated residue which will be the input to the next part of the encoder, the Discrete Cosine Transform (DCT) and Quantization block. 50 Implementation 6.2.1 Forward DCT and Quantization The DCT and the quantization were combined into one Sleipnir block to be able to save cycles by performing the quantization directly after the transform. The Quantization Parameter (QP) was chosen as a fixed value of 10, this value is easily changed if another fixed value would be desired. Adding support for a variable value of QP would cost both additional instructions and additional constants in the constants memory. To get as low execution times as possible while still following the H.264 standard a variable QP was left out. The order of computations in the DCT and quantization block is as follows: 1. Process the blocks through the first DCT stage. 2. Transpose the blocks. 3. Process the blocks through the second DCT stage. 4. Transpose the blocks again. 5. Multiply by MF, scale by qbits and round to get the result. The calculation of a 4×4 block based two-dimensional DCT as discussed in section 3.5.1 can be described as follows: 1. Calculate X0 .. X3 according to figure 3.9 for each row of the block. 2. Transpose the resulting 4 × 4 block to be able to calculate the DCT of the columns. 3. Calculate X0 .. X3 according to figure 3.9 for each column of the block. 4. Transpose the resulting 4 × 4 block again to get the final result. As the block is transposed two times the resulting block will not be transposed compared to the input block. The input data is presumed to be stored as 16-bit integers as this is the native Sleipnir datapath width. The input data itself consists of a number of 4 × 4 blocks of the residue pixel values. To utilize the full datapath width of the Sleipnir two 4 × 4 blocks can be calculated simultaneously. The flow of the two-dimensional DCT is depicted in figure 6.10. First the input data consisting of two 4 × 4 pixel blocks is read in and transformed through the first DCT stage. The result will be two one-dimensionally DCT-transformed 4 × 4 pixel blocks. The following transpose of the blocks will be performed as shown in figure 6.11. After that the transposed blocks will be processed by the second DCT stage and finally the blocks are yet again transposed to complete the two-dimensional DCT transform. 6.2 Discrete Cosine Transform and Quantization 1 2 3 4 17 18 19 20 + 5 6 7 8 21 22 23 24 + 9 10 11 12 25 26 27 28 - 13 14 15 16 29 30 31 32 - 1 5 9 13 17 21 25 29 + 2 6 10 14 18 22 26 30 + 3 7 11 15 19 23 27 31 + 4 8 12 16 20 24 28 32 + Second blockwise transpose 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 + -2 2 + 1 2 3 4 17 18 19 20 + 5 6 7 8 21 22 23 24 + 9 10 11 12 25 26 27 28 + 13 14 15 16 29 30 31 32 First DCT Input data of two 4x4 blocks of 16-bit integers 1 + - -2 2 51 First blockwise transpose + 1 5 9 + 2 6 10 14 18 22 26 30 13 17 21 25 29 + - 3 7 11 15 19 23 27 31 + - 4 8 12 16 20 24 28 32 Second DCT Input to the second DCT Final twodimensional DCT output 13 14 15 16 29 30 31 32 Figure 6.10: DCT flowchart Two 4x4 Blocks to be transposed 1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 Constant Memory Permutation Adressing 0 9 18 27 4 13 22 31 9 10 11 12 25 26 27 28 1 10 19 28 5 14 23 32 13 14 15 16 29 30 31 32 2 11 20 29 6 15 24 33 3 12 21 30 7 16 25 34 Transposed Output 0 1 2 3 4 17 18 19 20 1 5 9 13 17 21 25 29 8 x 5 6 7 2 6 10 14 18 22 26 30 16 24 x 3 7 11 15 19 23 27 31 24 27 28 x 13 14 15 16 29 4 8 12 16 20 24 28 32 32 30 31 32 x 8 21 22 23 9 10 11 12 25 26 x x Memory Mapping Figure 6.11: Memory transpose schematic x x 52 Implementation The blocks to be transposed will be stored in memory according to the Memory Mapping part of figure 6.11 where the data is displaced one address higher for each new vector stored. The displacement is necessary as only one value can be read out from each memory bank. In the Local Vector Memories (LVMs) there are 8 memory banks, one for each column of the memory. This setup enables the addressing of the memory according to prestored addresses in the Constant Memory (CM). As the arrows display in the figure the first address vector in CM is 0, 9, 18, 27, 4, 13, 22 and 31. This vector will fetch the values of pixels 1, 5, 9, 13, 17, 21, 25 and 29 which then can be stored in e.g. a vector register. By using the memory transpose as shown in figure 6.11 the transpose can be performed in only 4 vector copy instructions. An excerpt from the first transpose of the Sleipnir block code is vcopy vcopy vcopy vcopy vr0 vr3 vr1 vr2 m1 [ a r 1 m1 [ a r 1 m1 [ a r 1 m1 [ a r 1 + + + + cm [ACCESS_PATTERN_0_4 ] ] . vw cm [ACCESS_PATTERN_3_7 ] ] . vw cm [ACCESS_PATTERN_1_5 ] ] . vw cm [ACCESS_PATTERN_2_6 ] ] . vw where ar1 is an address register pointing to the location of data stored in memory m1 (Memory Mapping), vr0 to vr3 are vector registers and the access patterns are ACCESS_PATTERN_0_4: ACCESS_PATTERN_1_5: ACCESS_PATTERN_2_6: ACCESS_PATTERN_3_7: 0 1 2 3 9 18 27 4 13 22 31 10 19 28 5 14 23 32 11 20 29 6 15 24 33 12 21 30 7 16 25 34 as also shown in figure 6.11. The particular order of the vector registers in the excerpt comes from the minimization of data dependency in the following stage of the Sleipnir block code. Calculating the transpose of the two 4 × 4 blocks in only 4 instructions contributes to a fast DCT. Once the DCT is completed the final stage of the block performs the quantization. The final quantization formula from section 3.5.3 has the benefit of being easy to implement in integer arithmetic as the division can be replaced by a shift operation and MF only consists of integer numbers. The division by 2qbits can be rewritten as an arithmetic right shift by qbits. Utilizing this the implemented quantization expression can be written as Zij = round(Wij M Fij >> qbits) (6.1) where >> is the right shift operation. [6] The quantization was implemented by multiplying the 4 × 4 blocks by the Multiplication Factor (MF) described in equation (3.19) and table 3.2 in section 3.5.3. This results in 4 vector to vector multiplications between the blocks and the MF used for the current value of QP. The shift by qbits and rounding was implemented using scaling and rounding of the multiplication result which is a built in function of the multiplication instruction. An excerpt from the quantization part of the Sleipnir block code is vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr0 cm [MF_QP_10_1] 6.2 Discrete Cosine Transform and Quantization 53 vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr1 cm [MF_QP_10_2] vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr2 cm [MF_QP_10_1] vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr3 cm [MF_QP_10_2] where the values to be quantized are stored in the vector registers vr0 to vr3, ar2 is the address register pointing to the location in memory m0 where data will be stored and the Multiplication Factors (MF) for QP equal to 10 are MF_QP_10_1: 8192 5243 8192 5243 8192 5243 8192 5243 MF_QP_10_2: 5243 3355 5243 3355 5243 3355 5243 3355 which are derived from table 3.2 and 3.19 in section 3.5.3. The ability to quantisize two 4 × 4 blocks in 4 instructions gives a quick quantization. 6.2.2 Rescaling and Inverse DCT The rescaling and the Inverse DCT (IDCT) were also combined into one Sleipnir block to be able to save cycles by performing the IDCT directly after the rescaling. As with the DCT and quantization block only a fixed value of rescaling is supported to speed up execution while still following the H.264 standard. The order of computations in the IDCT and rescaling block is as follows: 1. Perform rescaling by multiplication of the blocks and V 0 . 2. Run the blocks trough the first IDCT stage. 3. Transpose the blocks. 4. Run the blocks trough the second IDCT stage. 5. Divide by 64 and round to get the result. The calculation of the 4×4 block based two-dimensional IDCT can be described as follows: 1. Calculate x0 .. x3 according to figure 3.10 for each row of the block. 2. Transpose the resulting 4 × 4 block to be able to calculate the IDCT of the columns. 3. Calculate x0 .. x3 according to figure 3.10 for each column of the block. 4. Transpose the resulting 4 × 4 block again to get the final result. As the block is transposed two times the resulting block will not be transposed compared to the input block. To utilize the full datapath width of the Sleipnirs two 4 × 4 blocks can be calculated on simultaneously. The first stage of the block performs the rescaling by multiplying the 4 × 4 blocks by the rescaling factors (V) which was described in equation (3.25) and table 3.3 in section 3.5.4. The final rescaling formula discussed in section 3.5.4 was Wij0 = Zij Vij 2f loor(QP/6) (6.2) 54 Implementation which like the final quantization formula has the benefit of being easy to implement in integer arithmetic. [6] The factor 2f loor(QP/6 causes the output to increase by a factor of two for every increment of 6 in QP. The factor 2f loor(QP/6) can be incorporated into V, reducing calculations with at least the calculation of f loor(QP/6) and one multiplication at the cost of having more constants in memory. As the constant memory is only read into the Sleipnir core once for each change of block, this was found to be beneficial, especially if a Sleipnir core will be dedicated to running the IDCT and Rescaling block. By incorporating the multiplication of 2f loor(QP/6) into V (6.2) can be rewritten as Wij0 = Zij Vij0 (6.3) where Vij0 is Vij with a built in scaling of 2 for every increase of 6 in QP. Note that the result from the following Inverse DCT has to be rescaled once more to remove the constant scaling factor of 64 introduced in (3.24) which was also incorporated in V. This is the formula used in the implementation of the rescaling part of the block. An excerpt from the rescaling part of the Sleipnir block code is vvmul<rnd , vvmul<rnd , vvmul<rnd , vvmul<rnd , scale scale scale scale = = = = 0, 0, 0, 0, ss> ss> ss> ss> vr1 vr3 vr0 vr2 m0 [ a r 0 m0 [ a r 0 m0 [ a r 0 m0 [ a r 0 + 8 ] . vw + 2 4 ] . vw ] . vw + 1 6 ] . vw cm [ V_QP_10_2 ] cm [ V_QP_10_2 ] cm [ V_QP_10_1 ] cm [ V_QP_10_1 ] where ar0 is the address register pointing to the location of the blocks in memory m0, the vector registers vr0 to vr3 will store the rescaled result and the rescaling factors (V) for QP equal to 10 are V_QP_10_1 : 32 40 32 40 32 40 32 40 V_QP_10_2 : 40 50 40 50 40 50 40 50 which are derived from table 3.3 and 3.25 in section 3.5.4. The ability to rescale the two 4 × 4 blocks in 4 instructions gives a quick rescaling. The IDCT is implemented much like the DCT described in section 6.2.1. Compared to the DCT the function of the transform stages is changed to those performing IDCT as described in section 3.5.2 but for example the transpose functionality is still the same. In addition the order of the different stages are changed to the reversed order of the DCT. The IDCT is followed by a arithmetic shift right by 6 bits which can be written as X = round(Xr >> 6) (6.4) where Xr is the output from the IDCT, >> is the right shift operation and X is the final output. This final shift gives a division by 64 and removes the constant scaling factor of 64 which was introduced from Vij0 . The final stage is done by a vector to vector multiplication using the built in functionality for scaling and rounding of the multiplication instruction. An excerpt from the final scaling part of the Sleipnir block code is 6.2 Discrete Cosine Transform and Quantization vvmul<rnd , vvmul<rnd , vvmul<rnd , vvmul<rnd , scale scale scale scale = = = = 6, 6, 6, 6, ss> ss> ss> ss> m1 [ a r 1 m1 [ a r 1 m1 [ a r 1 m1 [ a r 1 + 9 ] . vw + 1 8 ] . vw ] . vw + 2 7 ] . vw 55 vr0 vr2 vr4 vr5 cm [ONES] cm [ONES] cm [ONES] cm [ONES] where ar1 is the address register pointing to the location in memory m1 where the results should be stored, the vector registers vr0, 2, 4 and 5 contains the Xr values and ONES is a constant memory vector consisting of only 8 ones according to ONES: 1 1 1 1 1 1 1 1 as only functionality of the multipliers scaling and rounding is needed. Chapter 7 Results and Analysis In this chapter the performance results from the implementations of the kernels and blocks are presented. The results presented are for the motion estimation, motion compensation and transform and quantization. 7.1 Motion Estimation In this section the results from different simulations of motion estimation is presented. The result depends on different properties of the kernel code. Each subsection has a separate description of how the simulation was performed. A total of 5 kernels and 4 video sequences have been tested on 1, 2, 4 and 8 Sleipnir cores. The results are based on calculations of 7670 macroblocks which means that the edges of the frame have intentionally been left out. The edges consists of a number of 430 macroblocks. This simplification was done because these macroblocks would add special cases which could have been solved with for example a message box to each Sleipnir. Message boxes were not available in the revision of the simulator that was used. All test sequences were downloaded from [1]. The simulations are all executed with revision 9888 of the ePUMA simulator with a patch on event.hpp from revision 9958. The patch corrects event IDs for DMA and Sleipnir cores. In table 7.1 short names of the kernels under test are presented with a short description. These names will be used throughout this section. In table 7.2 the columns of the result tables are described. Datasent = Searches ∗ ((M Bsref erence + M Bsdata ) ∗ vectors_per_M B) (7.1) The amount of data sent to the blocks is in all cases calculated according to equation (7.1). Searches is the total number of searches performed and M Bsref erence is the number of macroblocks sent to the Sleipnir blocks as reference to be used as the search area. M Bsdata is the data macroblock(s) and vectors_per_M B is 16 when using a representation of 8-bits per pixel and 32 when using 16-bits per 57 58 Results and Analysis pixel. In equation (7.1) the data transfer cost of the DMA for programming the Sleipnirs’ PM and CM is not included. Kernel Name Kernel 1 Description Calculates the motion vector for one macroblock each execution. Program flow control is implemented with jump. Calculate the motion vector for one macroblock each execution. Program flow control has support for call and return. Calculates motion vectors for 5 macroblocks each execution. Program flow control has support for call and return. Calculates motion vectors for 13 macroblocks each execution. Program flow control has support for call and return. Calculates motion vectors and motion compensated residue blocks for 13 macroblocks. Program flow control has support for call and return. Kernel 2 Kernel 3 Kernel 4 Kernel 5 Table 7.1: Short names for kernels that have been tested Column name Core Number of starts Total cycles Idling cycles Runtime idle Utilization percent in Description Sleipnir core Number of times the Sleipnir core have been started with the specific block Total number of cycles that the Sleipnir has executed during simulation Total number of cycles that the Sleipnir has been idling during simulation Number of cycles the Sleipnir have been idling not including idle before first start and after last start Sleipnir utilization in % based on total number of cycles executed in the block and total simulated cycles Table 7.2: Description of table columns 7.1.1 Kernel 1 Results presented in this section are simulations of the Sleipnir block performing the motion vector calculation of one macroblock, this block is called block 1. The Sum of Absolute Difference (SAD) calculations are implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Program flow control is implemented using the jump instruction. Only simulations that required the most computational power are presented. 7.1 Motion Estimation 59 Result Core Sleipnir 0 Avg. util. Master Number of starts 7 670 Total cycles 30 853 670 Idling cycles 5 574 019 Runtime idle 5 572 399 Utilization in percent 84.7 84.7 36 427 689 Table 7.3: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 1 1 1 1 220 189 141 086 993 885 698 458 Total cycles 4 4 4 4 4 3 2 1 806 693 546 330 072 591 902 910 328 008 033 691 588 793 438 784 Idling cycles 1 1 1 1 2 2 3 4 363 476 623 838 096 577 266 258 006 326 301 643 746 541 896 550 Runtime idle 1 1 1 1 1 358 318 257 197 104 984 786 509 873 404 604 861 705 926 886 376 Utilization in percent 77.9 76.1 73.7 70.2 66.0 58.2 47.0 31.0 62.5 6 169 334 Table 7.4: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores Block 1 PM CM LVM 0 LVM 1 Cost 613 instructions 65 vectors 26 vectors 180 vectors Table 7.5: Block 1 costs The best runtime for one block execution was 1 584 and the worst was 10 386 cycles in both simulations. The amount of data that was sent to the blocks was 7 670 ∗ ((9 + 1) ∗ 16) = 1 227 200 vectors (18.73 MByte) calculated according to 60 Results and Analysis equation (7.1). 7670 vectors (0.12 MByte) was copied back to main memory from the blocks. Before any calculations can begin a prolog is executed to copy vectors to a second memory and to set up address registers. This prolog is 31 cycles in block 1. After the search has finished a epilog is executed which takes 8 cycles. Analysis Kernel 1 was the first working kernel and a proof of concept. The DMA configurations performed by the master is written in such a way that Sleipnir 0 has the highest priority and Sleipnir 1 has second priority an so on. This is the reason why Sleipnir 7 has a lower utilization compared to for example Sleipnir0. With this kernel it was found that a lot of cycles in the Sleipnir block were spent on state handling and extra overhead caused by the jump instruction that can only jump to immediate addresses. 7.1.2 Kernel 2 Results presented in this section are simulations of kernel 2. This kernel uses an improved version of block 1,called block 2. Block 2 calculates the motion vector of one macroblock each execution. The Sum of Absolute Difference (SAD) calculations are implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. In block 2 hardware support for call and return has been added to the simulator and this is utilized for program flow control. Result Core Sleipnir 0 Avg. util. Master Number of starts 7 670 Total cycles 16 933 783 Idling cycles 5 572 953 Runtime idle 5 571 478 Utilization in percent 75.2 75.2 22 506 736 Table 7.6: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core The results from simulation with riverbed video sequence are presented in table 7.6 and 7.7. The best runtime for one block 2 execution was 986 and the worst was 5 348 cycles in both simulations. The amount of data that was sent to the blocks was 7670 ∗ ((9 + 1) ∗ 16) = 1 227 200 vectors (18.73 MByte) and 7 670 vectors (0.12 MByte) was copied back as the result from the blocks. Before any calculations can begin a prolog is executed to copy vectors to a second memory and to set up address registers. This prolog is the same as in block 1 and therefore takes 31 7.1 Motion Estimation 61 cycles. After the search has finished a epilog is executed which also, as block 1, takes 8 cycles. Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 1 1 1 1 755 678 562 358 910 368 38 1 Total cycles 3 3 3 3 2 848 652 393 004 069 867 95 2 831 381 931 121 282 747 305 178 Idling cycles 2 2 2 2 3 4 5 5 003 199 458 848 783 984 757 850 542 992 442 252 091 626 068 195 Runtime idle 2 1 1 1 1 000 909 770 542 035 417 41 1 872 706 069 502 091 652 958 181 Utilization in percent 65.8 62.4 58.0 51.3 35.4 14.8 1.6 0.0 36.2 5 852 373 Table 7.7: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores Block 2 PM CM LVM 0 LVM 1 Cost 442 instructions 65 vectors 26 vectors 180 vectors Table 7.8: Block 2 costs Analysis In block 2 an improvement of 37.8% in best execution time can be seen compared to block 1. There is also an improvement of 48.5% in the worst execution time compared to block 1. This improvement is significant and should lower the total execution time of one frame but as can be seen the total execution time is only improved by 5.1%. The explenation is that the average utilization of the Sleipnir cores have decreased from 84.7% to 75.2% in the simulation of 1 Sleipnir and 62.5% to 36.2% in the simulation with 8 Sleipnirs. In table 7.7 it can be seen that the utilization of Sleipnir 7 is 0.0%. This indicates that the blocks are executing to few cycles in the Sleipnirs or that the master is too slow and does not feed the Sleipnirs with enough data. Targeting the master code does not offer too many opportunities of optimization and the complexity of the code has not yet reached the complexity of a complete encoder. It was therefore concluded that searching of more macroblocks in the block should be investigated. Table 7.8 shows the 62 Results and Analysis memory cost of block 2. 7.1.3 Kernel 3 Results presented in this section are simulations of kernel 3. This kernel uses a Sleipnir block called block 3. Block 3 is a further development of block 2 where a wrapper that handles looping has been added. Block 3 calculates the motion vectors of 5 macroblocks during each execution. As in kernel 2 the Sum of Absolute Difference (SAD) calculations are implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Result Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 198 195 193 194 191 190 187 186 Total cycles 2 2 2 2 2 2 2 2 226 235 218 187 152 175 135 139 674 280 086 261 389 113 931 942 Idling cycles 273 265 282 313 348 325 364 360 956 350 544 369 241 517 699 688 Runtime idle 264 264 256 255 253 252 250 247 792 586 840 665 583 341 903 395 Utilization in percent 89.0 89.4 88.7 87.5 86.1 87.0 85.4 85.6 87.3 2 500 630 Table 7.9: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores Block 3 PM CM LVM 0 LVM 1 Cost 478 instructions 67 vectors 26 vectors 444 vectors Table 7.10: Kernel 3 costs The results from simulation with riverbed video sequence are presented in table 7.9. The best runtime for one Sleipnir block execution was 5 866 and the worst was 19 416 cycles. The amount of data that was sent to the blocks were (7 670/5)∗((3∗ 7 + 5) ∗ 16) = 638 144 vectors (9.74 MByte) and 7 670 vectors (0.12 MByte) were 7.1 Motion Estimation 63 copied back as the result from the blocks. The prolog in block 3 is slightly larger than in block 2, it is now 46 cycles. The epilog has also increased and now takes 83 cycles. Between the calculations on each macroblock there is an intermission that takes 43 cycles to finish. This intermission changes offsets for memory reads and copies a new macroblock to the second memory. Analysis Kernel 3 resulted in a 57.3% improvement of total simulation time on execution on 8 Sleipnirs compared to kernel 2. The utilization has increased to over 85% in Sleipnir 7 which is more acceptable. The wrapper introduced in block 3 only required 36 instructions extra compared to block 2. The increase in LVM memory needed is for storage of the 16 extra macroblocks, 4 more motion vectors and extra overhead from e.g. the added loop counter. 7.1.4 Kernel 4 Results presented in this section are simulations of kernel 4. This kernel uses a Sleipnir block called block 4. Block 4 is the next step of improvement of the Sleipnir blocks and it calculates 13 motion vectors during each execution. As in block 2 and 3 the Sum of Absolute Difference (SAD) calculations are implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Result Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Avg. util. Master Number of starts 149 146 148 147 Total cycles 4 4 4 4 370 359 377 345 096 195 214 295 Idling cycles 262 272 254 286 074 975 956 875 Runtime idle 253 250 249 250 825 107 644 289 Utilization in percent 94.3 94.1 94.5 93.8 94.2 4 632 170 Table 7.11: Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores The results from simulation with riverbed video sequence are presented in table 7.12 and 7.11. The best runtime for one Sleipnir block execution was 18 057 and the worst was 42 896 cycles in both simulations. The amount of data that was sent to the blocks was (7 670/13) ∗ ((15 ∗ 3 + 13) ∗ 16) = 547 520 vectors (8.35 MByte) and 7 670 vectors (0.12 MByte) was copied back as the result from the 64 Results and Analysis blocks. Block 4 has the same prolog, intermission and epilog cycle cost as block 3 i.e 46, 43 and 83 cycles. Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 75 74 74 75 75 73 72 72 Total cycles 2 2 2 2 2 2 2 2 214 200 194 191 191 174 127 155 930 402 769 467 754 953 822 699 Idling cycles 146 160 166 169 169 186 233 205 170 698 331 633 346 147 278 401 Runtime idle 130 129 129 130 132 127 125 127 223 794 572 113 144 130 152 116 Utilization in percent 93.8 93.2 93.0 92.8 92.8 92.1 90.1 91.3 92.4 2 361 100 Table 7.12: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores Block 4 PM CM LVM 0 LVM 1 Cost 478 instructions 67 vectors 26 vectors 964 vectors Table 7.13: Kernel 4 costs Analysis Kernel 4 pushes the utilization up to over 90% in every Sleipnir. The total simulation time has decreased from 2.50 Mega cycles (Mc) to 2.36 Mc which results in an improvement of 5.6% percent. Kernel 4 only copies 85.8% of the data compared to kernel 3. This decrease in memory data transfers will help later when the whole encoder is implemented. The cost of local memory used in the Sleipnir block has increased by 520 vectors compared to block 3. 7.1 Motion Estimation 7.1.5 65 Kernel 5 Results presented in this section are simulations of kernel 5. This kernel uses a Sleipnir block called block 5. Block 5 uses the same motion estimation code as block 4 where as before the Sum of Absolute Difference (SAD) calculations are implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Added to block 5 is code for calculating the motion compensated residue macroblock which is done using the HVBSUBWA and HVBSUBNA instructions as discussed in section 6.1.2. The benefit of doing this in the same Sleipnir block is that all extra overhead for moving data to another kernel is avoided. In this part simulation results from 4 different video sequences are presented to highlight that there is a difference in total simulation time depending on the data that is fed to the Sleipnir blocks. Result Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 74 75 74 73 74 74 73 73 Total cycles 1 1 1 1 1 1 1 1 783 806 799 766 787 798 776 765 865 053 925 940 586 046 852 463 Idling cycles 198 176 182 215 195 184 205 217 721 533 661 646 000 540 734 123 Runtime idle 175 178 176 173 178 178 176 173 131 070 137 459 102 102 123 901 Utilization in percent 90.0 91.1 90.8 89.1 90.2 90.7 89.6 89.0 90.1 1 982 586 Table 7.14: Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores The results from simulation with sunflower video sequence are presented in table 7.14. The best runtime for one Sleipnir block execution was 18 457 and the worst runtime was 28 039 cycles. In table 7.15 simulation on blue sky video sequence is done which resulted in a best runtime for one Sleipnir block execution of 18 079 cycles and a worst runtime of 41 415 cycles. 66 Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Results and Analysis Number of starts 74 75 74 73 73 75 74 72 Total cycles 1 1 1 1 1 1 1 1 945 954 929 919 926 926 926 895 498 782 502 413 889 105 972 777 Idling cycles 195 186 211 222 214 215 214 245 944 660 940 029 553 337 470 665 Runtime idle 170 167 170 165 164 171 172 168 596 994 267 579 997 946 619 488 Utilization in percent 90.8 91.3 90.1 89.6 90.0 89.9 90.0 88.5 90.0 2 141 442 Table 7.15: Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores The third simulation was done with pedestrian area clip and the results can be found in table 7.16. The best runtime for one Sleipnir block execution was 15 378 cycles and the worst was 47 611 cycles. Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 79 75 73 75 71 73 73 71 Total cycles 2 2 2 2 2 2 2 2 089 071 071 049 080 058 043 031 641 619 989 072 353 428 657 873 Idling cycles 193 211 211 234 203 225 239 251 948 970 600 517 236 161 932 716 Runtime idle 184 171 168 177 164 171 171 164 881 855 548 366 579 021 602 102 Utilization in percent 91.5 90.7 90.7 89.7 91.1 90.1 89.5 89.0 90.3 2 283 589 Table 7.16: Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores 7.1 Motion Estimation Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Avg. util. Master Number of starts 147 147 148 148 67 Total cycles 4 4 4 4 659 628 616 629 746 572 250 588 Idling cycles 315 346 359 345 540 714 036 698 Runtime idle 311 310 313 315 485 410 279 330 Utilization in percent 93.7 93.0 92.8 93.1 93.1 4 975 286 Table 7.17: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores Core Sleipnir 0 Sleipnir 1 Sleipnir 2 Sleipnir 3 Sleipnir 4 Sleipnir 5 Sleipnir 6 Sleipnir 7 Avg. util. Master Number of starts 75 74 73 75 74 75 72 72 Total cycles 2 2 2 2 2 2 2 2 333 349 331 323 331 305 299 260 516 169 312 308 564 196 853 234 Idling cycles 207 191 209 217 209 235 240 280 336 683 540 544 288 656 999 618 Runtime idle 188 187 185 189 188 186 184 180 358 756 101 070 828 202 476 283 Utilization in percent 91.8 92.5 91.8 91.4 91.8 90.7 90.5 89.0 91.2 2 540 852 Table 7.18: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores The last simulation was done on the riverbed video sequence and the results can be found in table 7.18. The best runtime for one Sleipnir block execution was 19 884 cycles and the worst was 44 739 cycles in the simulations presented in table 7.18 and table 7.17. The amount of data that was sent to the Sleipnir blocks was (7 670/13) ∗ ((15 ∗ 3 + 13) ∗ 16) = 547 520 vectors (8.35 MByte) and 7 670 ∗ 33 = 253 110 vectors (1.99 MByte) was copied from the blocks. The prolog cost for block 5 is 46 cycles and intermission cycle cost is 43 which is the same as in block 4. The epilog in block 5 is 185 and comes from the time it takes to save the motion compensated residue to local vector memory. 68 Results and Analysis Block 5 PM CM LVM 0 LVM 1 copy LVM→VR copy CM→VR Cost 574 instructions 64 vectors 26 vectors 1411 vectors 34 instructions 23 instructions Table 7.19: Kernel 5 costs Analysis The difference from block 4 can easily be seen in table 7.19 where the memory cost has increased a lot due to the extra vectors needed for storing the motion compensated residues. This also causes a need for extra data to be copied back to main memory which increases the runtime idle. The differences can be seen if comparing table 7.12 and table 7.18. The cycle cost from kernel 5 is not based on complete full HD frames which was mentioned in the beginning of the chapter. Equation (7.2) is a calculated approximation of this increased cost when the input data is a complete full HD frame. N umber of M B = 8 100 N umber of M B in kernel 5 = 7 670 8 100 Pinc = = 1.06 7 670 T otal cycle cost = 2 540 852 ∗ Pinc = 2 693 304 (7.2) The number of copy instructions from one of the LVMs and from the CM to the Vector Register (VR) are listed in table 7.19. These copies do not add any compuational functionality and are therefore not desirable. For block 5 these numbers are rather low which indicates that not too much unnecessary copying is done. Some of these copies are used to speed up the block, for example by pre-loading a value to the vector register instead of reading it from the CM the execution of the instruction using it will finish faster. 7.1 Motion Estimation 7.1.6 69 Master Code The master code is used when testing the motion estimation blocks also known as block 1, block 2, block 3, block 4 and block 5. Program Memory Costs The master codes used in kernel 1, 2, 3, 4 and 5 are slightly different but have the same size. The simulations done with 1, 2, 4 and 8 Sleipnirs differs some in code size caused by removal of code only used when keeping track of more Sleipnirs. The DMA firmware that was used is not included in the statistics for the master but instead it got a row of its own because it is included in all the kernels. Description Master with 1 Sleipnirs Master with 2 Sleipnirs Master with 4 Sleipnirs Master with 8 Sleipnirs DMA Firmware Code size 326 363 437 585 272 RAM 58 60 64 72 0 ROM 16 16 16 16 0 Table 7.20: Master code cost In table 7.20 the column Code size is measured in number of instructions and column RAM and ROM is measured in words. Table 7.20 shows that the DMA Firmware does not use any memory. That is not really the case but instead the memory cost has been included in the master code costs so that it is not counted twice. Worth to mention is that the DMA firmware was not written by the authors and therefore got its own row for the cost. The ROM contains information for the DMA pointing to the addresses where the program memory and constant memory for the Sleipnirs are stored. The instruction costs of the master have not been a target for optimization and therefore it can provide room for improvement. The reason for not optimizing the cost nor code is that it will not be used in exactly this way when used in a complete encoder. In the main memory data for 2 complete full HD frames with 4:2:0 sampling and the size of the Sleipnirs program memory and constant memory is allocated. The master also allocates memory for the resulting motion vectors and motion compensated residue blocks. Prolog and Epilog Before any calculations can be done the environment has to be configured in the processor. Table 7.21 introduces this cost. The cycle count of the long prolog includes configuration of stack pointer, interrupt handling, set up of registers, programming the Sleipnir cores PM and CM and data copying to LVM. This can also be described as the cycle count until the first Sleipnir core is started. The short prolog is the same as the long prolog except that the data copying to LVM has been excluded. 70 Results and Analysis Task Prolog short Prolog long Epilog Cycles 929 38 262 277 Table 7.21: Prolog and epilog cycle costs When all calculations have been performed an epilog is initiated. This epilog is for the finalization of the kernel. In this case it empties the last Sleipnir core of calculation results. The cycle count of waiting for the last Sleipnir to finish has not been included because it is dependent on which data the calculations are performed upon. Video Sequence Blue Sky Sunflower Pedestrian Area Riverbed Epilog cycles 25 955 23 169 22 350 31 969 Table 7.22: Simulated epilog cycle cost including waiting for last Sleipnir to finish The epilog cycle costs including waiting for the last kernel to finish has been measured and the results are presented in table 7.22. Results from the 4 different video sequences can of course be worse than in table 7.22 but a better understanding of the cycle cost can be achieved. All results are from simulations with 8 Sleipnir cores. As can be seen in table 7.22 the riverbed simulation had the longest epilog execution time. This can vary and is not necessarily related to the overall computational load of the frame. The last parts to be motion estimated will be the down-right corner of the frame. One of these columns of 13 macroblocks will likely be the last column a Sleipnir has to process in the end. This will be the Sleipnir that the master has to wait for. If there are a lot of motion in the down-right corner of the frame the calculation of the motion vectors will need more cycles to finish and the epilog will therefore cost more cycles. DMA To initiate a DMA transfer the DMA module needs to be configured and the transfer needs to be started. The DMA firmware provides subroutines to do this. Table 7.23 provides DMA costs from kernel 5. In table 7.23 transfer cost for search data is 760 cycles. The observant reader notice that kernel 5 only should need 720 cycles to transfer 720 vectors. The measurement is done in such a way that all extra penalties are included which means that cycle cost for interrupt and return are included. The transfer time is therefore longer than expected. 7.1 Motion Estimation 71 Task Loading Sleipnir PM Configure Start Transfer block 5 Loading Sleipnir CM Configure Start Transfer block 5 DMA Firmware Configure search data Configure results Start search data Start results Transfer search data, block 5 Transfer MB search for, block 5 Transfer results, block 5 Cycles 41 39 666 41 44 106 75 59 62 62 760 250 55 Table 7.23: DMA cycle costs The cost for copying the Sleipnir PM and CM are also presented in table 7.23. These costs are included in the prolog of the program. The total cost of the prolog can be found in table 7.21. Considering this when implementing a complete encoder, decisions can be taken whether to distribute the different blocks between different cores or if the cores are going to be loaded with a new block between tasks. Table 7.23 also shows three different data transfers. At least two is needed, one for filling a LVM with data and one for emptying the LVM. Reason for using three different transfers was that an easier memory allocation scheme in main memory could be used. To gain a better utilization of the Sleipnir cores the master needs to start them and have them running as much as possible. One way to increase utilization is to try doing as much as possible during DMA transactions. The master that was used for simulating kernel 1 to 5 did not offer that much opportunity to hide cycles during DMA transfer. During transaction of search data 98 cycles could be executed and during transfer of the macroblocks to search for 22 cycles could be executed. When the results is copied back to main memory 0 cycles could be saved. 7.1.7 Summary The results from kernel 2 can be compared with the cycle cost of the H.264 encoder for the STI Cell processor which can be found in [15] and [11]. There the cycle cost of performing a hexagon search on a macroblock of 16 × 16 pixels in a (15,15)x(-15,15) search area is listed. This corresponds to the same functionality 72 Results and Analysis as was implemented in Sleipnir blocks 1 through 4. The listed cycle costs for the best and worst case searches are 1 451 and 3 609 cycles respectively. These results can be compared to the best and worst runtime for kernel 2 for running motion estimation on a macroblock of 16 × 16 pixels in a (-15,15)x(-15,15) search area for the riverbed video sequence. Kernel 2 is used as it is possible to measure each search separately and the functionality is still the same. The results for the best and worst runtimes were 986 and 5 348 cycles. This shows that the best case runtime is substantially shorter for the ePUMA implementation. However the worst case is on the other hand substantially better for the STI Cell implementation. Block 2 still offers room for improvement. The low runtime of the best case shows that the search and overhead needed could be optimized further for long searches to reach better worst case performance. Scalability Tabell1 40000000 35000000 30000000 25000000 Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5 20000000 15000000 10000000 5000000 0 0 1 2 3 4 5 6 7 8 9 Figure 7.1: Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed Figure 7.1 shows a graph of the scaling from 1 core to 8 cores for the 5 different kernels. It can be seen that the scaling is almost linear on simulations with kernel 4 and 5. This shows that the master can fully utilize the extra cores to speed up calculations. The simulation results that the graph is created from can be found in appendix B. In the same table simulation results from pedestrian area, sunflower and blue sky can be found. The reason for better scaling in kernel 4 and 5 is due to the fact that more calculations and therefore cycles are performed during each execution of a Sleipnir core. By doing more cycles in the kernel the master has more time to provide the other 7 cores with data between two executions of a Sleipnir core. The processor is utilized the best when all Sleipnir cores are running simultaneously. The easiest way to increase speed further would be either to try optimizing kernel 5 even more or try writing code for a master that utilizes the 7.1 Motion Estimation 73 third local vector memory that is connected to the DMA. Utilizing the LVM will make it possible to hide more DMA cycles and by that increase utilization of the Sleipnir cores which will result in a faster total execution time. Energy Reduction Results Figure 7.2: Frame 10 from Pedestrian Area video sequence Figure 7.3: Difference between frame 10 and frame 11 in Pedestrian Area video sequence To see the real difference between ordinary residue calculation and motion compensated residue calculation both will be presented as well as the differences between them. Figure 7.2 is frame 10 from pedestrian area video sequence and shows the 74 Results and Analysis back of a person in the center of the image and a lot of moving people on the street in the background. Figure 7.3 presents the residue between frame 10 and frame 11. The white areas in the picture indicates big differences between the two frames. Darker areas in the figure indicates better matching between the two frames. 60 50 40 30 20 10 0 20 40 60 80 100 Figure 7.4: Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence Figure 7.5: Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation 7.2 Transform and Quantization 75 In figure 7.4 the calculated motion vectors are shown. The motion vectors illustrates how the macroblocks have been estimated to move. Areas that move very little or not at all have short or no motion vectors resulting in the area being very bright while areas with longer motion vectors show up as darker areas. The illustrated effect of motion compensation can be seen in figure 7.5. More darker areas and less brighter areas are visible which implicates that the residue is smaller and will need less space after compression. The improvements can not only be seen but can also be proven by the numbers. By using equation (2.4) and equation (2.5) introduced in section 2.6 the Peak Signal to Noise Ratio (PSNR) and Mean Square Error (MSE) can be calculated. The residue in figure 7.3 has a MSE of 313 and a PSNR of 23.16 dB. The MSE of the motion compensated residue, illustrated in figure 7.5, has a MSE of 51 and a PSNR of 31.06 dB which is a significant improvement. 7.2 Transform and Quantization This section present the results from the Sleipnir blocks DCT with quantization and IDCT with rescaling. Results Two 4x4 DCT + quantization PM CM LVM0 LVM1 copy LVM→VR copy CM→VR Two 4x4 IDCT + rescale PM CM LVM0 LVM1 copy LVM→VR copy CM→VR Cost 54 instructions 12 vectors 8 vectors 5 vectors 4 instructions 0 instructions Cost 51 instructions 12 vectors 8 vectors 5 vectors 0 instructions 0 instructions Table 7.24: Costs for DCT with quantization block and IDCT with rescaling block The results in table 7.24 is for a fixed value of QP equal to 10. Using a fixed value of QP gives a fast execution time and low program and constant memory costs. If a variable QP would be desired this would need some extra program memory instructions and at least three extra vectors in the constant memory for every additional QP. The cycle cost for running two 4x4 pixel blocks through the DCT with quantization block is 72 cycles. The cycle cost for running two 4x4 pixel 76 Results and Analysis blocks through the IDCT with rescaling block is 69 cycles. This means that if fed with enough data to transform the DCT with quantization block could transform and quantisize one full HD frame in 72 ∗ (16/2) ∗ 8 100 = 4 665 600 cycles as calculated according to equation (7.3). total_cycles = cycles ∗ 4x4_blocks_per_M B ∗ M Bs_per_f rame blocks_calculated_per_execution (7.3) The same calculation for the IDCT with rescaling block gives 69 ∗ (16/2) ∗ 8 100 = 4 471 200 cycles. These numbers make out a lower limit of execution cycles needed using the current blocks but still give an approximation of the cycle costs. A cost not too far from these could be achieved if each block had a dedicated Sleipnir core that is kept well fed with input data. These results can be compared with cycle cost in the STI Cell processor which can be found in [15] and [11]. There the cycle cost of performing DCT, IDCT, quantization and rescaling are presented. The cost listed to perform DCT on 4 blocks of 4 × 4 pixels is 100 cycles and the cost to perform quantization of 2 blocks of 4×4 pixels is 96 cycles. The total cost for running four blocks through DCT and quantization can therefore be summed up to DCT +Quant∗2 = 100+(96∗2) = 292 cycles. In the ePUMA case the cycle cost of performing DCT and quantization of 4 blocks of 4 × 4 pixels sums up to the cost of running the DCT with quantization block twice. This gives 2 ∗ 72 = 144 cycles which is only 49.3% of the cycle cost listed for the STI Cell implementation. The listed cost to perform IDCT and rescaling for 4 blocks in the STI Cell is 2 ∗ Dequant + IDCT = (2 ∗ 88) + 96 = 272 cycles for a QP value less than 24. The cost for performing the same operations in the ePUMA is equal to the cost of running the IDCT with rescaling block twice which sums up to 2 ∗ 69 = 138 cycles which is only 50.7% of the cycle cost listed for the STI Cell. Analysis In table 7.24 it can be seen that the DCT and IDCT Sleipnir blocks are fairly small. It would be possible to make them even smaller and faster by adding more complex instructions. ePUMA has hardware support for doing a radix-4 butterfly in one instruction. By modifying this instruction for the purpose of doing H.264 integer DCT this would increase computation speed of the DCT and IDCT. As seen in both kernel 1 and 2 for motion estimation the utilization became low when using multiple Sleipnir cores. This was because the master could not provide data in a satisfying speed. The solution to this was to add support for calculation of motion vectors for more macroblocks in a single execution. This technique would of course be beneficial in the DCT and IDCT blocks too. If even more performance is needed an increase of vector registers could be a solution. This would make it possible to replace some nop instructions with calculating instructions instead which would increase the throughput. This solution would be possible even without the proposed complex instruction above. The number of copy instructions from one of the LVMs and from the CM to the Vector Register (VR) are listed in table 7.24. These copies do not add any 7.2 Transform and Quantization 77 computational functionality and are therefore not desirable. For the DCT with quantization and IDCT with rescaling blocks these numbers are very low which indicates that very little unnecessary copying is done. Chapter 8 Discussion This chapter presents a discussion about the ePUMA architecture, improvements of hardware, considerations when programming the processor and how things were done. 8.1 DMA H.264 is a very memory demanding compression algorithm and that is one of the reasons why it is a challenge to make a real-time encoder. The master that was implemented used a pre-developed DMA firmware which provided functions to configure and start all different kinds of DMA tasks. This includes both one- and two-dimensional memory transfers, broadcasts and Sleipnir to Sleipnir transfers. This can make the firmware a target for optimization in an encoder. The time it took to configure and start a new DMA task could in some cases become a problem if a small chunk of data was to be copied. A message box for these kind of small communications is therefore useful. At the point the thesis was started this message system was not available in the simulator and has therefore not been used. 8.2 Main Memory The simulator version that was used when implementing the kernels and Sleipnir blocks did not take the latency of the off-chip main memory into account. The latency was set to the same access time as the local storage which means that the results in reality would likely be worse. There are techniques to reduce this off-chip latency but these delays were not considered in the results as the simulator does not have support for simulating them. 79 80 8.3 Discussion Program Memory For block 5 the program memory size became a problem. The problem is not critical because the size of the program memory is not yet decided. The aim of the thesis was to test the ePUMA architectures capabilities in terms of real-time encoding and therefore speed was more important. Shrinking the block could have an impact on the blocks performance but that is not necessarily the case. With new techniques as register forwarding instructions could be removed and the block would shrink. 8.4 Constant Memory The constant memory can only address a complete vector at a time. This can result in problems when a program uses a lot of constants. These constants will use an entire vector of 128 bits even though only one word is needed. There are two solutions to this problem. The first solution is to change the addressing of the CM to be able to address words separately just like in the LVMs. The second solution is to make it possible for some instructions to accept an immediate operand. 8.5 Vector Register File In chapter 7 it was mentioned that a better performance of for example the DCT could be achieved if the vector register file was increased. The vector memory is of course a very expensive memory considering it must be a multiport memory. Increasing the size of the register file will increase the total size of the core. The penalty from the small register file is paid in program memory size in form of nop instructions and extra cycles. Considering the length of the pipeline with its 15 stages the number of vector registers seems low. With more vector registers a better instruction pipelining could be achieved. Only a few extra vector registers, maybe as few as one or two, could give a substantial gain in performance for certain applications where currently all 8 vector registers are used and a single extra register could save many inserted nop instructions. The extra hardware cost could be paid for by reducing the size of the LVMs. This is easy to see considering only around 1400 vectors were used at most in one LVM. This might also be a bad idea considering that the processor will be used for other applications which may have more use for a large LVM. The DCT is one example where using the entire LVM would be efficient if the DMA transfers can be hidden. 8.6 Register Forwarding A lot of nop instructions were used in the blocks due to data dependency problems. This problem could be solved with register forwarding. This will increase the hardware cost and the complexity of the core but can give a good boost in performance. 8.7 New Instructions 8.7 81 New Instructions A lot of new instructions were added during development of the blocks. This was mainly because they were mentioned in the instruction set but not implemented in the simulator. Some application specific instructions were also added that did not exist in the instruction set from the beginning. 8.7.1 SAD Calculations The newly added instructions HVBSUBWA, HVBSUBNA, HVBSUMABSDWA and HVBSUMABSDNA will cost some extra hardware. The extra cost will come from support for byte operand selection in the low half vector. 8.7.2 Call and Return Call and return instructions were also proposed. This proposal did not offer an unlimited size of the stack. The solution to not have a stack pointer in memory offers some benefits which mainly is that access to LVMs would not be needed. The call and return instructions could therefore be executed already in stage 2 of the pipeline. The call and return will therefore only have 1 delay slot and a maximum of 1 cycle will be wasted to nothing when call or return is used. 8.8 Master and Sleipnir Core A topic that came up during implementation was if the master core was too slow to provide all the Sleipnirs with data. When using kernel 1 and 2 this was a problem because the blocks finished execution too fast compared to how long it took for the master to provide data to all 8 Sleipnirs. This problem became smaller when increasing the workload for the Sleipnir cores. The question still stands due to the fact that the workload of the master will be much higher when a complete encoder is implemented. Utilizing the 3 LVMs better by DMA pre-loading of data in the idling LVM a lot of DMA cycles can be hidden and a better utilization of the Sleipnir cores can be achieved. The possibility for the master to spread its workload over more cycles is thereby enabled but it will still have to complete a lot of work. Some of the problems were solved by making the Sleipnir blocks smarter. This will of course impact on the complexity of the hardware in the Sleipnir core. If the blocks were to be more straightforward calculation blocks the master would have to take the workload of 8 Sleipnir cores decision making. In the case of the motion estimation where a search takes (7 + (3 ∗ n) + 4) decisions, where n is the number of points searched, each search would result in 8 times that many decisions in average for the master to make. The result would also have to be copied to and from the cores each time. With the cycle cost of the DMA start and configure in mind this solution grows fast in theoretical cycle cost. The message box system could of course be utilized and therefore the DMA cost will disappear but the 82 Discussion decisions would still have to be made and the evaluation of the messages would need some calculation time. Another thing to consider is the data copying back and forth from and to the LVMs. This could be avoided if the Sleipnir program memory was reprogrammed instead, leaving the large amount of data intact in the LVMs. Reprogramming the Sleipnir PM and CM to another blocks functionality is relatively cheap compared to the DMA transfer of one full LVM. This can not be utilized on a complete frame because the LVM is insufficient. If the frames are divided into slices of preferable size this could be exploited and a frame could be encoded in a number of stages where each slice is a stage. This approach could save a lot of DMA transfers. 8.9 ePUMA H.264 Encoding Performance The thesis aimed to evaluate how much capability the ePUMA processor offered for H.264 encoding. The results from motion estimation including motion compensation show that it can be done in less than 3 Mega cycles on 8 Sleipnir cores. Motion estimation is estimated to consume from 60% to 80% of the total encoding time [15]. There are some difference between the simplified motion estimation that is performed in this thesis and the motion estimation in the paper. Block 5 does not include motion estimation for blocks smaller than 16x16 pixels. Therefore this estimation of encoding time might be misguiding in this case. Estimating the DCTs total cycle consumption when using 8 cores it can be seen that it is a lot smaller than the cycle consumption of motion estimation. This tells us that the motion estimation still consumes a major part of the encoding time and will likely have the majority of the Sleipnir cores dedicated to it if the tasks of a complete encoder is divided amongst the Sleipnirs. 8.10 ePUMA Advantages When programming the ePUMA architecture some features were used more than others and they will be presented in this section. The possibility of making conditional execution on every instruction in the Sleipnir core was something that was used on a number of different places. The sections in the code where it was used saved some instructions of memory and also execution time. Permutation of vectors and memory was another feature that was used. This feature was used when transposing matrices in the DCT and Quantization block and made the execution time of the block significantly faster. The pipeline offered two stages of ALUs which made it possible to calculate the sum of 8 words in one instruction. This was used in the calculation of the SAD in the motion estimation block. This was also used in the proposed complex instructions. 8.11 Observations 8.11 83 Observations The DCT task cycle length is proportional to the amount of data that is copied to the LVMs. As can be seen in figure 8.1 the DCT does not need the same advanced setup of task scheduling as the ME needed which can be seen in figure 6.9. Main Memory Result in Main Memory Sleipnir DCT Execution Result 2 Sleipnir 1 Result 1 Result 0 Task 4 Task 3 Task 2 Task 1 Task 0 Sleipnir 0 Sleipnir 2 Task 122 Task 121 Task 120 Task 119 Task 118 Sleipnir 3 Sleipnir 4 Time Current Time Figure 8.1: Sleipnir core DCT task partitioning and synchronization Local Vector Memory B1 B2 B3 B4 B5 B6 Local Vector Memory B7 B8 B1 B2 B3 B4 B5 B6 B7 B8 X X M A B C D E F G H I J K L X G H L X M A B C D E F I X J X K X X Figure 8.2: Memory allocation of macroblock in LVM for intra coding Due to the time constraint some parts of the encoder were left out. One of these parts was intra coding i.e. coding of an I-frame and I-slice. This is a very interesting part of encoding and will probably need some computation power. A similar problem has already been solved on the STI Cell processor in [13]. The partitioning of the frame done there could be used and adapted to the ePUMA memory size. The size of the slices would probably have to be slightly smaller. The problem that has to be solved is the memory allocation when calculating the intra prediction. A memory allocation mapping that could be used can be seen 84 Discussion in figure 8.2. As can be seen the memory is displaced one word for every vector. This gives the opportunity to copy data to a new memory and permute the data in two instructions when intra coding. The drawback is that more memory has to be allocated. The figure describes a half macroblock residing in a LVM. On the right side of the separator the memory is displaced one word to be able to copy a column of data namely I, J, K, L. B1, B2, B3 and so on are the 8 memory banks in the LVM. Another part that was left out was the implementation of the deblocking filter. The function of the deblocking filter is described in section 3.6. This task is complex due to a lot of data dependency. Work has been done on the STI Cell processor in [11] which may be worth looking into. The technique of using a wave front when applying the filter is a good idea and will work on the ePUMA as well. To solve the memory allocation problem when applying the deblocking filter a similar displacement of vectors as in figure 8.2 can be used. This will make it possible to read samples both row wise and column wise in one instruction each. After that the filter can be applied. Chapter 9 Conclusions and Future Work This chapter gives the conclusions of the work done in this thesis and proposes the future work that could be done in the area. 9.1 Conclusions In this thesis selected parts of a video encoder using the H.264 standard were implemented and benchmarked using the ePUMA system simulator. The parts focused upon were motion estimation, motion compensation, DCT, IDCT, quantization and rescaling. The answers to the questions at issue from section 5.1.1 will be presented here. Is it possible to perform real-time full HD video encoding at 30 FPS using the H.264 standard in the ePUMA processor? This is not a simple question and it can not be fully answered from the work done in this thesis. From the results obtained it can be seen that it is possible to perform a simplified version of motion estimation and compensation on one of the full HD frames from the riverbed video sequence in slightly less than 5 Mega cycles on 4 Sleipnir cores. If this is good enough for real-time encoding at 30 frames per second depends on the clock frequency of the processor. The cycle cost per frame for the DCT with quantization and IDCT with rescaling Sleipnir blocks were approximated to take less than 5 Mega cycles if running on one Sleipnir core each. This means motion estimation and compensation will still make out the larger part of calculations for complex video sequences. Motion estimation and compensation will not be performed if a frame is chosen to be intra coded, this means a future intra coding block could use the 4 Sleipnirs otherwise used by the motion estimation and compensation block. A setup like this would leave 2 Sleipnir cores to handle the deblocking filter. 85 86 Conclusions and Future Work Would it be possible to modify the processor architecture to reach better performance and if so, would it be worth the cost of the potentially added hardware? During the work done in the thesis a few ideas of hardware improvement have been thought of. The first one being functionality for call and return control intructions using a small hardware stack. This enables quick calls and returns without having to implement a software stack in memory. For most applications having only a few levels of the stack is enough and the speedup gained from such a small hardware cost makes it well worth it. The second part of added functionallity would be byte selection from vector operands. This would be used in the proposed instructions depicted in appendix A. What are the cycle costs compared to the STI Cell H.264 encoder? The DCT and quantization and IDCT and rescaling Sleipnir blocks implemented on the ePUMA both have a cycle cost of about 50% compared to the implementations on the STI Cell processor. The simplified motion estimation and compensation kernel implemented has a better best case and worse worst case cycle cost compared to the STI Cell implementation. 9.2 Future Work There are many more interesting parts to work on and investigate in this area. One of them is to investigate if additional motion vectors should be calculated in the motion estimation and compensation Sleipnir blocks for a greater performance gain and if it would be beneficial to use a more square frame partitioning. Functionality for performing motion estimation and motion compensation on the complete frames including the edges could also be interesting to implement to see how well the message boxes of the Sleipnir cores can handle the special cases that arise. An intra prediction Sleipnir block could be developed to investigate the computational complexity of that part of an encoder and how well the task can be parallelized on the ePUMA architecture. The deblocking filter would be interesting to see if it can be executed on two Sleipnir cores while consuming less cycles than motion estimation executed on four Sleipnir cores. Another task needed for the implementation of a final ePUMA H.264 encoder would be to investigate what overhead and additional information that has to be appended to each macroblock or frame. The final step would be to make a complete encoder by writing master code that can use and coordinate all the Sleipnir blocks. Bibliography [1] 1080p test sequences. ftp://ftp.ldv.e-technik.tu-muenchen.de/pub/test_sequences/1080p/, 2010. May [2] ePUMA research team in Div of Computer Engineering. ePUMA Platform Hardware Architecture. February 2010. [3] ePUMA research team in Div of Computer Engineering. ePUMA simulator manual. March 2010. [4] ePUMA research team in Div of Computer Engineering. Sleipnir Instruction Set Manual. April 2010. [5] P. List, A. Joch, J Lainema, G. Bjøntegaard, and M. Karczewicz. Adaptive deblocking filter. IEEE Transactions on circuits and systems for video technology, July 2003. Vol. 13, No. 7. [6] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky. Low-complexity transform and quantization in h.264/avc. IEEE Transactions on circuits and systems for video technology, July 2003. Vol. 13, No. 7. [7] D. Marpe, H. Schwarz, and T. Wiegand. Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard. IEEE Transactions on circuits and systems for video technology, July 2003. Vol. 13, No. 7. [8] Div of Computer Engineering at LiU. Senior assembler and simulator user manual. http://www.da.isy.liu.se/courses/tsea26/labs/2009/ senior_assembler_and_simulator.pdf, September 2008. [9] Div of Computer Engineering at LiU. Senior instruction set manual. http://www.da.isy.liu.se/courses/tsea26/labs/2009/ senior_instruction_set_manual.pdf, September 2008. [10] Iain E. G. Richardson. H.264 and MPEG-4 Video Compression. Wiley, 2003. ISBN 0-470-84837-5. 87 88 Bibliography [11] Lim Boon Shyang. A simplified high definition video encoder based on the sti cell multiprocessor. Master’s thesis, Linköpings Tekniska Högskola, January 2007. [12] International Telecommunication Union. H.264 : Advanced video coding for generic audiovisual services. Technical report, ITU-T, 2009. [13] Zhengzhe Wei. H.264 baseline real-time high definition encoder on cell. Master’s thesis. [14] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra. Overview of the h.264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, July 2003. Vol. 13, No. 7. [15] D. Wu, B. Lim, J. Eilert, and D. Liu. Parallelization of high-performance video encoding on a single-chip multiprocessor. In 2007 IEEE International Conference on Signal Processing and Communications, pages 145–148, November 2007. [16] C. Zhu, X. Lin, and L. Chau. Hexagon-based search pattern for fast block motion estimation. IEEE Transactions on circuits and systems for video technology, May 2002. Vol. 12, No. 5. Appendix A Proposed Instructions Operand B (128-bit) Operand A (128-bit) x x x x x x x x - - ’1' msb 0 - ’1' 1 0 - ’1' msb x x x x x x x x msb 1 0 - ’1' ’1' msb 1 0 - ’1' msb 1 0 1 - ’1' msb 0 1 - ’1' msb 0 1 msb 0 1 ’0' ’0' ’0' ’0' ’0' ’0' ’0' ’0' ABS ABS ABS ABS ABS ABS ABS ABS + + + + + + x x x x x x Figure A.1: HVBSUMABSDWA 89 90 Proposed Instructions Operand B (128-bit) Operand A (128-bit) x x x x x x x x - - ’1' msb 0 - ’1' 1 0 - ’1' msb x x x x x x x x ’1' msb 1 0 - msb 1 0 1 - ’1' ’1' msb 0 1 - ’1' msb 0 1 - ’1' msb 0 1 msb 0 1 ’0' ’0' ’0' ’0' ’0' ’0' ’0' ’0' ABS ABS ABS ABS ABS ABS ABS ABS + + + + + + x x x x x x Figure A.2: HVBSUMABSDNA Operand B (128-bit) Operand A (128-bit) x x x x x x x x - ’1' - ’1' - ’1' - x x x x x x x x ’1' - ’1' - Figure A.3: HVBSUBWA ’1' - ’1' - ’1' 91 Operand B (128-bit) Operand A (128-bit) x - x x x x x x x ’1' - ’1' - ’1' - x x x x x x x x ’1' - ’1' - Figure A.4: HVBSUBNA ’1' - ’1' - ’1' Appendix B Results Blue Sky Sleipnirs 1 2 4 8 Sunflower Sleipnirs 1 2 4 8 Pedestrain Sleipnirs 1 2 4 8 Riverbed Sleipnirs 1 2 4 8 Kernel 1 30191409 15349954 8095485 5954035 Kernel 2 19411816 9984269 5974027 5789997 Kernel 3 15895187 7985390 4047430 2121506 Kernel 4 15250896 7651458 3849476 1962095 Kernel 5 16564840 8312959 4187640 2141442 Kernel 1 27821589 14104241 7397368 5898648 Area Kernel 1 32400765 16398183 8819394 6154442 Kernel 2 18200644 9277820 5742826 5797178 Kernel 3 14692631 7380247 3726547 1933467 Kernel 4 14048580 7043845 3544288 1802451 Kernel 5 15424708 7734366 3889893 1982586 Kernel 2 20493772 10491439 6454460 5825473 Kernel 3 16979339 8523687 4314053 2335474 Kernel 4 16334940 8182102 4113342 2101209 Kernel 5 17636560 8832762 4447240 2283589 Kernel 1 36427689 18463337 9630235 6169334 Kernel 2 22506736 11586716 6571084 5852373 Kernel 3 18990971 9537789 4824777 2500630 Kernel 4 18346788 9197313 4632170 2361100 Kernel 5 19673956 9866870 4975286 2540852 Table B.1: Simulation cycle cost of motion estimation kernels 92 Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/ c Jonas Einemo, Magnus Lundqvist

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising