Institutionen för systemteknik Department of Electrical Engineering

Institutionen för systemteknik Department of Electrical Engineering
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
A Selection of H.264 Encoder Components
Implemented and Benchmarked on a Multi-core
DSP Processor
Examensarbete utfört i Datorteknik
vid Tekniska högskolan i Linköping
av
Jonas Einemo
Magnus Lundqvist
LiTH-ISY-EX--10/4392--SE
Linköping 2010
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
A Selection of H.264 Encoder Components
Implemented and Benchmarked on a Multi-core
DSP Processor
Examensarbete utfört i Datorteknik
vid Tekniska högskolan i Linköping
av
Jonas Einemo
Magnus Lundqvist
LiTH-ISY-EX--10/4392--SE
Handledare:
Olof Kraigher
isy, Linköpings universitet
Examinator:
Dake Liu
isy, Linköpings universitet
Linköping, 15 June, 2010
Avdelning, Institution
Division, Department
Datum
Date
Division of Computer Engineering
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Språk
Language
Rapporttyp
Report category
ISBN
Svenska/Swedish
Licentiatavhandling
ISRN
Engelska/English
Examensarbete
C-uppsats
D-uppsats
Övrig rapport
2010-06-15
—
LiTH-ISY-EX--10/4392--SE
Serietitel och serienummer ISSN
Title of series, numbering
—
URL för elektronisk version
http://www.da.isy.liu.se/en/index.html
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4292
Titel
Title
A Selection of H.264 Encoder Components Implemented and Benchmarked on a
Multi-core DSP Processor
Författare
Jonas Einemo, Magnus Lundqvist
Author
Sammanfattning
Abstract
H.264 is a video coding standard which offers high data compression rate at the
cost of a high computational load. This thesis evaluates how well parts of the
H.264 standard can be implemented for a new multi-core digital signal processing
processor architecture called ePUMA. The thesis investigates if real-time encoding
of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform,
inverse discrete cosine transform, quantization and rescaling parts of the H.264
standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another
multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million
cycles per frame. This setup leaves 2 calculation cores to run the remaining parts
of the encoder.
Nyckelord
Keywords
ePUMA, DSP, SIMD, H.264, Parallel Programming, Motion Estimation, DCT
Abstract
H.264 is a video coding standard which offers high data compression rate at the
cost of a high computational load. This thesis evaluates how well parts of the
H.264 standard can be implemented for a new multi-core digital signal processing
processor architecture called ePUMA. The thesis investigates if real-time encoding
of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform,
inverse discrete cosine transform, quantization and rescaling parts of the H.264
standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another
multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million
cycles per frame. This setup leaves 2 calculation cores to run the remaining parts
of the encoder.
Acknowledgments
We would like to thank everyone that has helped us during our thesis work, especially our supervisor Olof Kraigher for all help and useful hints and our examiner
Professor Dake Liu for his support, comments and the opportunity to do this thesis. We would also like to thank Jian Wang for the support on the DMA firmware,
Jens Ogniewski for the help with understanding the H.264 standard, our families
and friends for their support and for bearing with us during the work on this thesis.
Jonas Einemo
Linköping, June 2010
Magnus Lundqvist
Contents
1 Introduction
1.1 Background .
1.2 Purpose . . .
1.3 Scope . . . .
1.4 Way of Work
1.5 Outline . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
2
2
2 Overview of Video Coding
2.1 Introduction to Video Coding . . . .
2.2 Color Spaces . . . . . . . . . . . . .
2.3 Predictive Coding . . . . . . . . . .
2.4 Transform Coding and Quantization
2.5 Entropy Coding . . . . . . . . . . . .
2.6 Quality Measurements . . . . . . . .
2.6.1 Subjective Quality . . . . . .
2.6.2 Objective Quality . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
7
7
8
8
8
8
3 Overview of H.264
3.1 Introduction to H.264 . . . . . . . . . . .
3.2 Coded Slices . . . . . . . . . . . . . . . .
3.2.1 I Slice . . . . . . . . . . . . . . . .
3.2.2 P Slice . . . . . . . . . . . . . . . .
3.2.3 B Slice . . . . . . . . . . . . . . . .
3.2.4 SP Slice . . . . . . . . . . . . . . .
3.2.5 SI Slice . . . . . . . . . . . . . . .
3.3 Intra Prediction . . . . . . . . . . . . . . .
3.4 Inter Prediction . . . . . . . . . . . . . . .
3.4.1 Hexagon search . . . . . . . . . . .
3.5 Transform Coding and Quantization . . .
3.5.1 Discrete Cosine Transform . . . . .
3.5.2 Inverse Discrete Cosine Transform
3.5.3 Quantization . . . . . . . . . . . .
3.5.4 Rescaling . . . . . . . . . . . . . .
3.6 Deblocking filter . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
12
12
12
13
13
14
17
18
18
20
21
22
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
.
.
.
.
.
ii
Contents
3.7
Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4 Overview of the ePUMA Architecture
4.1 Introduction to ePUMA . . . . . . . .
4.2 ePUMA Memory Hierarchy . . . . . .
4.3 Master Core . . . . . . . . . . . . . . .
4.3.1 Master Memory Architecture .
4.3.2 Master Instruction Set . . . . .
4.3.3 Datapath . . . . . . . . . . . .
4.4 Sleipnir Core . . . . . . . . . . . . . .
4.4.1 Sleipnir Memory Architecture .
4.4.2 Datapath . . . . . . . . . . . .
4.4.3 Sleipnir Instruction Set . . . .
4.4.4 Complex Instructions . . . . .
4.5 DMA Controller . . . . . . . . . . . .
4.6 Simulator . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
27
29
29
29
29
30
31
33
34
34
34
35
5 Elaboration of Objectives
5.1 Task Specification . . . .
5.1.1 Questions at Issue
5.2 Method . . . . . . . . . .
5.3 Procedure . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
38
38
6 Implementation
6.1 Motion Estimation . . . . . . . . . . . . . . .
6.1.1 Motion Estimation Reference . . . . .
6.1.2 Complex Instructions . . . . . . . . .
6.1.3 Sleipnir Blocks . . . . . . . . . . . . .
6.1.4 Master Code . . . . . . . . . . . . . .
6.2 Discrete Cosine Transform and Quantization
6.2.1 Forward DCT and Quantization . . .
6.2.2 Rescaling and Inverse DCT . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
40
41
47
49
50
53
.
.
.
.
.
.
.
.
.
57
57
58
60
62
63
65
69
71
75
.
.
.
.
.
.
.
.
7 Results and Analysis
7.1 Motion Estimation . . . . . .
7.1.1 Kernel 1 . . . . . . . .
7.1.2 Kernel 2 . . . . . . . .
7.1.3 Kernel 3 . . . . . . . .
7.1.4 Kernel 4 . . . . . . . .
7.1.5 Kernel 5 . . . . . . . .
7.1.6 Master Code . . . . .
7.1.7 Summary . . . . . . .
7.2 Transform and Quantization .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
8 Discussion
8.1 DMA . . . . . . . . . . . . . . . . . .
8.2 Main Memory . . . . . . . . . . . . . .
8.3 Program Memory . . . . . . . . . . . .
8.4 Constant Memory . . . . . . . . . . .
8.5 Vector Register File . . . . . . . . . .
8.6 Register Forwarding . . . . . . . . . .
8.7 New Instructions . . . . . . . . . . . .
8.7.1 SAD Calculations . . . . . . .
8.7.2 Call and Return . . . . . . . .
8.8 Master and Sleipnir Core . . . . . . .
8.9 ePUMA H.264 Encoding Performance
8.10 ePUMA Advantages . . . . . . . . . .
8.11 Observations . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
79
79
79
80
80
80
80
81
81
81
81
82
82
83
9 Conclusions and Future Work
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
85
86
Bibliography
87
A Proposed Instructions
89
B Results
92
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
Contents
List of Figures
2.1
2.2
Overview of the data flow in a basic encoder and a decoder . . . .
YUV 4:2:0 sampling format . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
5
7
Overview of the data flow in an H.264 encoder . . . . . . . . . . .
4x4 luma prediction modes . . . . . . . . . . . . . . . . . . . . . .
16x16 luma prediction modes . . . . . . . . . . . . . . . . . . . . .
Different ways to split a macroblock in inter prediction. . . . . . .
Subsamples interpolated from neighboring pixels . . . . . . . . . .
Multiple frame prediction . . . . . . . . . . . . . . . . . . . . . . .
Large(a) and small(b) search pattern in the hexagon search algorithm.
Movement of the hexagon pattern in a search area and the change
to the smaller search pattern. . . . . . . . . . . . . . . . . . . . . .
3.9 DCT functional schematic . . . . . . . . . . . . . . . . . . . . . . .
3.10 IDCT functional schematic . . . . . . . . . . . . . . . . . . . . . .
3.11 Filtering order of a 16x16 pixel macroblock with start in A and end
in H for luminance(a) and start in 1 and end in 4 for chrominance(b)
3.12 Pixels in blocks adjacent to vertical and horizontal boundaries . .
12
13
13
14
15
16
17
4.1
4.2
4.3
4.4
4.5
28
28
30
33
35
ePUMA memory hierarchy . . . . . .
ePUMA star network interconnection .
Senior datapath for short instructions
Sleipnir datapath pipeline schematic .
Sleipnir Local Store switch . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
18
19
20
24
24
Motion estimation program flowchart . . . . . . . . . . . . . . . . .
Motion estimation computational flowchart . . . . . . . . . . . . .
Hexagon search program flow controller . . . . . . . . . . . . . . .
Proposed implementation of call and return hardware . . . . . . .
Reference macroblock overlap . . . . . . . . . . . . . . . . . . . . .
Reference macroblock partitioning for 13 data macroblocks . . . .
Master program flowchart . . . . . . . . . . . . . . . . . . . . . . .
Memory allocation of data memory in the master(a) and main memory allocation(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.9 Sleipnir core motion estimation task partitioning and synchronization
6.10 DCT flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.11 Memory transpose schematic . . . . . . . . . . . . . . . . . . . . .
42
43
44
45
45
46
47
7.1
7.2
7.3
72
73
7.4
7.5
Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed .
Frame 10 from Pedestrian Area video sequence . . . . . . . . . . .
Difference between frame 10 and frame 11 in Pedestrian Area video
sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion vector field calculated by kernel 5 on frame 10 and 11 of the
Pedestrian Area video sequence . . . . . . . . . . . . . . . . . . . .
Difference between frame 10 and frame 11 in Pedestrian Area video
sequence using motion compensation . . . . . . . . . . . . . . . . .
48
49
51
51
73
74
74
Contents
v
8.1
8.2
Sleipnir core DCT task partitioning and synchronization . . . . . .
Memory allocation of macroblock in LVM for intra coding . . . . .
83
83
A.1
A.2
A.3
A.4
HVBSUMABSDWA
HVBSUMABSDNA
HVBSUBWA . . . .
HVBSUBNA . . . .
89
90
90
91
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
Contents
List of Tables
3.1
3.2
3.3
Qstep for a few different values of QP . . . . . . . . . . . . . . . . .
Multiplication factor MF . . . . . . . . . . . . . . . . . . . . . . . .
Scaling factor V . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
22
23
4.1
4.2
4.3
4.4
Pipeline specification . . . . . . . . . .
Register file access types . . . . . . . .
Address register increment operations
Addressing modes examples . . . . . .
30
31
32
32
7.1
7.2
7.3
Short names for kernels that have been tested . . . . . . . . . . . .
Description of table columns . . . . . . . . . . . . . . . . . . . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 1 using 1 Sleipnir core . . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores . . . .
Block 1 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 2 using 1 Sleipnir core . . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores . . . .
Block 2 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores . . . .
Kernel 3 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion estimation results from simulation with Riverbed frame 10
and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores . . . .
Kernel 4 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion estimation results from simulation on Sunflower frame 10
and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores . . . .
Motion estimation results from simulation on Blue sky frame 10
and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores . . . . .
Motion estimation results from simulation on Pedestrian area frame
10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores . . . .
Motion estimation results from simulation on Riverbed frame 10
and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores . . . . . .
Kernel 5 costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Master code cost . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prolog and epilog cycle costs . . . . . . . . . . . . . . . . . . . . .
Simulated epilog cycle cost including waiting for last Sleipnir to finish
DMA cycle costs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
7.20
7.21
7.22
7.23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
58
59
59
59
60
61
61
62
62
63
64
64
65
66
66
67
67
68
69
70
70
71
Contents
vii
7.24 Costs for DCT with quantization block and IDCT with rescaling
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
B.1 Simulation cycle cost of motion estimation kernels . . . . . . . . .
92
viii
Contents
Abbreviations
AGU
ALU
AVC
CABAC
CAVLC
CB
CM
CODEC
DCT
DMA
DSP
ePUMA
Address Generation Unit
Arithmetic Logic Unit
Advanced Video Coding
Context-based Adaptive Binary Arithmetic Coding
Context-based Adaptive Variable Length Coding
Copy Back
Constant Memory
COder/DECoder
Discrete Cosine Transform
Direct Memory Access
Digital Signal Processing
Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access
FIR
Finite Impulse Response
FPS
Frames Per Second
FS
Full Search
HDTV
High-Definition Television
HVBSUBNA
Half Vector Bytewise SUBtraction Not word Aligned
HVBSUBWA
Half Vector Bytewise SUBtraction Word Aligned
HVBSUMABSDNA Half Vector Bytewise SUM of ABSolute Differences
Not word Aligned
HVBSUMABSDWA Half Vector Bytewise SUM of ABSolute Differences
Word Aligned
IDCT
Inverse Discrete Cosine Transform
IEC
International Electrotechnical Commission
ISO
International Organization for Standardization
ITU
International Telecommunications Union
LS
Local Storage
LVM
Local Vector Memory
MAE
Mean Abolute Error
MB
Macroblock
MC
Motion Compensation
ME
Motion Estimation
MF
Multiplication Factor
MPEG
Moving Picture Experts Group
MSE
Mean Square Error
NAL
Network Abstraction Layer
NoC
Network on Chip
PM
Program Memory
PSNR
Peak Signal to Noise Ration
QP
Quantization Parameter
RAM
Random Access Memory
Contents
RGB
ROM
SAD
SPRF
STI
V
VCEG
VRF
YUV
ix
Red, Green and Blue, A color space
Read Only Memory
Sum of absolute difference
SPecial Register File
Sony Toshiba IBM
Rescaling Factor
Video Coding Experts Group
Vector Register File
A color space
Chapter 1
Introduction
This chapter gives a background to the thesis, defines the purpose, scope, way of
work and presents the outline of the thesis.
1.1
Background
With new handheld devices and mobile systems with more advanced services the
need for increased computational power at low cost, both in terms of chip area and
power dissipation, is ever increasing. Now that video playback and recording are
more standard applications than features in mobile devices, high computational
power at a low cost is still a problem without a sufficient solution.
The Division of Computer Engineering at the Department of Electrical Engineering at Linköpings Tekniska Högskola has for some time been part of a research
project called ePUMA, which can be read out as Embedded Parallel Digital Signal
Processing Processor Architecture with Unique Memory Access. The development
is driven by the pursuit of the next generation of digital signal processing demands.
By developing a cheap and low power processor with large calculation power this
new architecture aims to meet tomorrows demands in digital signal processing.
The main applications for the processor is future radio base stations, radar and
High-Definition Television (HDTV).
H.264 is a standard for video compression that saw daylight back in 2003. It
is now a mature and widely spread standard that is used in Blu-Ray, popular
video streaming websites like Youtube, television services and video conferencing.
It provides very good compression at the cost of high computational complexity.
The hope is that the ePUMA multi-core architecture will be able to handle realtime video encoding using the H.264 standard.
At the Division of Computer Engineering previous work has been done on
implementing an H.264 encoder for another multi-core architecture. This work
was done on the STI Cell which is used in e.g. the popular video gaming console
PLAYSTATION 3.
1
2
1.2
Introduction
Purpose
The purpose of this master thesis is to evaluate the capability of the ePUMA
processor architecture, in aspect of real-time video encoding using the H.264 video
compression standard and aim to find and expose possible areas of improvement
on the ePUMA architecture. This will be done by implementing parts of an H.264
encoder and if possible compare the cycles needed to the previously implemented
STI Cell H.264 encoder.
1.3
Scope
By implementing the most computationally expensive parts in the H.264 standard
it would be possible to better estimate if the ePUMA processor architecture is
capable of encoding video using the H.264 standard in real time. Studying the
H.264 standard it can be seen that entropy coding is the most time consuming
part if it is done in software. Because of the large amount of bit manipulations
needed, it is not feasible to perform entropy coding in the processor. Therefore an
early decision was made that entropy coding had to be hardware accelerated and
that it should not be a part of this thesis.
In this thesis no exact hardware costs for performance improvement will be
calculated but instead a reasoning of feasibility will be done.
The time constraint of this master thesis is twenty weeks which restricts the
extent of the work. Because of the time constraint some parts of a complete
encoder have had to be left out.
1.4
Way of Work
One of the most time consuming tasks is motion estimation which together with
discrete cosine transform and quantization became the primary target for evaluation. First a working implementation was produced. An iterative development
was then used to refine the implementations and reach better performance. The
partial implementations of the H.264 standard were written for the ePUMA system simulator. The simulator was also used for all performance measurements of
the implementations using frames from several different commonly used test video
sequences. Once the performance measurement results were acquired they were
analyzed and the conclusions were made. The way of work is elaborated in section
5.2 and section 5.3.
1.5
Outline
This thesis is aimed at an audience with an education in electrical engineering,
computer engineering or similar. Expertise in video coding or the H.264 standard
is not necessary as the main principles of these topics will be covered.
The outline of this thesis is ordered as naturally as possible where this introduction chapter is followed by theoretical chapters containing the topics needed
1.5 Outline
3
to understand the rest of the thesis. The first of these is chapter 2 which covers
the basics of video coding followed by chapter 3 which offers an introduction to
the H.264 video coding standard. The last theoretical chapter is chapter 4 which
covers the hardware architecture and toolchain of the ePUMA processor. The
theory is followed by chapter 5 where a more detailed task specification, method
and procedure of the thesis is presented with help from the knowledge obtained
from the theoretical chapters. After that chapter 6 describes the function and development of the implementations produced. Chapter 7 then presents the results
obtained and gives an analysis of them. Chapter 8 contains a discussion about the
results as well as ideas thought of while working on this thesis. The final chapter is
chapter 9 which contains the conclusions and the future work that could be done
in the area.
Chapter 2
Overview of Video Coding
This chapter gives an introduction to video coding, color spaces, predictive coding,
transform coding and entropy coding. The knowledge is necessary to be able to
understand the rest of the thesis.
2.1
Introduction to Video Coding
A video consists of several images, called frames, showed in a sequence. The
amount of space on disk required to store a sequence of raw data is huge and
therefore video coding is needed. The purpose of video coding is to minimize the
data to store on disk or the data to send over a network, without decreasing the
image quality too much. There are a lot of techniques and algorithms on the
market to do this such as MPEG-2, MPEG-4 and H.264/AVC. [10]
Video
Data
Predictive
coding
Transform coding &
Quantization
Entropy
coding
Encoded
Video
Decoded
Video
Predictive
decoding
Inverse transform &
Rescaling
Entropy
decoding
Encoded
Video
Figure 2.1: Overview of the data flow in a basic encoder and a decoder
All of these algorithms are constructed out of a similar template. First some
technique is used to reduce the amount of data to be transformed. The video is
then transformed with for example a Discrete Cosine Transform (DCT). After this
a quantization is performed to shrink the data further. The data is then pushed
5
6
Overview of Video Coding
through an entropy coder such as Huffman or a more advanced algorithm such as
Context-based Adaptive Binary Arithmetic Coding (CABAC) or Context-Based
Arithmetic Variable Length Coding (CAVLC) which all compress the data based
on patterns in the bit-stream. [10] The data flow of a basic encoder and a basic
decoder is illustrated in figure 2.1.
As mentioned a video sequence consists of many frames. In video coding these
frames can be divided into something called slices. A slice can be a part of a frame
or contain the complete frame. This slice division is advantageous because it gives
ability to know e.g. that data in a slice does not depend on data outside the slice.
The frames are also divided into something called macroblocks. A macroblock is a
block consisting of 16×16 pixels. This partitioning of the data makes computations
easier to organize and structure. [10]
2.2
Color Spaces
To understand video coding some knowledge about different color spaces is needed.
One of the color spaces out there is RGB, which name comes from its components
red, green and blue. With these three colors and different intensities of them it
is possible to visualize all colors in the spectra. Another commonly used color
space is Y Cb Cr , also called YUV. In this color space Y represents the luminance
(luma) component, which corresponds to the brightness of a specific pixel. The
other two components, namely Cb and Cr , are chrominance (chroma) components
which carry the color information. [10] The conversion from the RGB color space
to the YUV color space is shown in equation (2.1).
Y = kr R + kg G + kb B
Cb = B − Y
Cr = R − Y
(2.1)
Cg = G − Y
As seen in equation (2.1) there also exists a third chrominance component for
green, namely Cg , which thanks to equation (2.2) can be calculated as shown in
equation (2.3). This means that Cg can be calculated by the decoder and does not
have to be transmitted which is advantageous in the sense of data compression.
[10]
kb + kr + kg = 1
(2.2)
Cg = Y − Cb − Cr
(2.3)
The human eye is more sensitive to luminance than to chrominance and because
of that a smaller set of bits can be used to represent the chrominance and a larger
for representation of luminance. With this feature of the YUV color space the
total amount of bits needed to encode a pixel can be reduced. A common way to
do this is by applying the 4:2:0 sampling format.
2.3 Predictive Coding
7
Y sample
Cr sample
Cb sample
Figure 2.2: YUV 4:2:0 sampling format
The 4:2:0 sampling format can be described as a ’12 bits per pixel’ format where
there are 2 samples of chrominance for every 4 samples of luminance as shown in
figure 2.2. If each sample is stored using 8 bits this will add up to 6 ∗ 8 = 48 bits
for 4 YUV 4:2:0 pixels with an average of 48/4 = 12 bits per pixel. [10]
2.3
Predictive Coding
There are two kinds of predictive coding, intra coding and inter coding. By studying a picture it is easy to see that some parts in the picture are very similar, this
is called spatial correlation. The predictive coding that uses these spatial correlations within a frame to form a prediction of other parts of the frame is called
intra coding. By studying a sequence of pictures or a video sequence it can be
seen that there is usually not much difference between the frames, this is called
temporal correlation. By exploiting this temporal correlation a difference, also
called a residue, can be calculated which is comprised of smaller values and therefore can be described with a smaller number of bits. This will result in better
data compression. The predictive coding that uses temporal correlations between
different frames is called inter coding. [10]
2.4
Transform Coding and Quantization
The purpose of transform coding is to convert the image data or motion compensated data into another representation of data. This can be done with a number
of different algorithms where the block based Discrete Cosine Transform (DCT) is
one of the most common in video coding. The DCT algorithm converts the data
to be described into sums of cosine functions oscillating at different frequencies.
[10]
8
Overview of Video Coding
There are some different transforms that could be used in video coding but the
common property of them all is that they are reversible, meaning the transform
can be reversed without loss of data. This is an important property because
otherwise drift between the encoder and decoder can occur and special algorithms
would have to be applied to correct these errors. As mentioned before block based
transform coding is the most common. When using block based transform coding
the picture is divided into smaller block such as 8 × 8 or 4 × 4 pixels. Each
block is then transformed with the chosen transform. The transformed data is
then quantisized to remove high frequency data. This procedure can be done
because the human eye is insensitive to higher frequencies and therefore these can
be removed without any noticeable loss of quality. The quantizer re-maps the
input data with one range of values to the output data which has a smaller range
of possible values. This means the output can be coded with fewer bits than the
original data and in this way data compression is achieved. [10]
2.5
Entropy Coding
Entropy coding is a lossless data compression technique. The different entropy
coding algorithms encode symbols that occur often with a few number of bits and
symbols that occur less often with more bits. The bits are all put in a bitstream
that could be written to disk or sent over a network. In video coding these symbols
can be quantisized transform coefficients, motion vectors, headers or other information that should be sent to be able to decode the video stream. As mentioned
earlier a few of the usual entropy coding algorithms are Huffman, CABAC and
CAVLC. [10]
2.6
Quality Measurements
There exists several ways to measure the quality of images and compare uncompressed images with reconstructed ones to evaluate video coding algorithms.
2.6.1
Subjective Quality
Subjective quality is the quality that someone watching an image or a video sequence experiences. Subjective quality can be measured by having evaluators rate
each part of a series of images or video sequences with different properties. This
can be a time consuming and unpractical way of measurement in most circumstances. [10]
2.6.2
Objective Quality
To enable more automatic measurements of quality some algorithms are commonly
used. One of these is Peak Signal to Noise Ratio (PSNR) which can be used to
measure the quality of a reconstructed image by comparing it to an uncompressed
2.6 Quality Measurements
9
one. PSNR gives a logarithmic scale where a higher value is better. The Mean
Square Error (MSE) is used in the calculation of PSNR and is calculated as
m
M SE =
n
1 XX
(C(i, j) − R(i, j))2
m ∗ n i=1 j=1
(2.4)
where n is the image height, m is the image width and C and R are the current
and reference images being compared. With the MSE value the PSNR can be
calculated as
bits
2
−1
P SN R = 10 ∗ log10
(2.5)
M SE
where 2bits −1 is the largest representable value of a pixel with the specified number
of bits. [10]
Chapter 3
Overview of H.264
This chapter presents an overview of the H.264 video compression standard. Some
sections are more detailed than others because of relevance for the master thesis.
The topics covered include the different frame and slice types, intra and inter
prediction, transform coding, quantization, deblocking filter and finally entropy
coding.
3.1
Introduction to H.264
H.264[12], also known as Advanced Video Coding (AVC) and MPEG-4 Part 10, is
a standard for video compression. The standard has been developed by Video Coding Experts Group (VCEG) of International Telecommunications Union (ITU) and
Moving Picture Experts Group (MPEG) which is a working group of the International Organization for Standardization (ISO) and International Electrotechnical
Commission (IEC). The main objective when H.264 was developed was to maximize the efficiency of the video compression but also to provide a standard with
high transmission efficiency which supports reliable and robust transmission of
data over different channels and networks. [10]
H.264 is divided into a number of different profiles. These profiles include
different parts of the video coding features from the H.264 standard. Some of the
most common ones are the Extended, Baseline, Constrained Baseline and Main
profiles. The Baseline profile supports inter and intra coding and entropy coding
with CAVLC. The Main profile supports interlaced video, inter coding using Bslices and entropy coding using CABAC. The Extended profile does not support
interlaced video nor CABAC but supports switching slices and has improved error
resilience. [10]
In figure 3.1 a detailed view of the data flow in an H.264 encoder can be seen.
This figure illustrates the important prediction coding and how it is connected to
the other parts of the encoder. The in-loop deblocking filter can also be seen in
this illustration. [10]
11
12
Overview of H.264
DCT
+
Fn
(current frame)
ME
MC
(motion
estimation)
(motion
compensation)
(discrete
cosine
transform)
-
Q
F´n-1
(quantization)
(reference
frame)
Choose
Intra
Prediction
F´n
(reconstructed
frame)
Deblocking
Filter
R
Intra
prediction
Reorder
(rescaling)
+
+
IDCT
(inverse
discrete cosine
transform)
Entropy
encode
NAL
Figure 3.1: Overview of the data flow in an H.264 encoder
3.2
Coded Slices
A frame can be divided into smaller parts called slices. These slices can then be
coded in different modes. The different coding modes in H.264 is presented below
[14].
3.2.1
I Slice
In the I slice all macroblocks are intra coded. The encoder uses the spatial correlations within a single slice to code that slice. The I slice allocates most space of
all the different types of slices after it has been encoded. [10]
3.2.2
P Slice
P slices can contain both I coded macroblocks and P coded macroblocks. P coded
macroblocks are predicted from a list of reference macroblocks. [10]
3.2.3
B Slice
B slices or bidirectional slices can contain both B coded macroblocks and I coded
macroblocks. B coded macroblocks can be predicted from two different lists of
reference macroblocks both before and after the current frame in time. [10]
3.2.4
SP Slice
A Switching P (SP) slice is coded in a way that supports easy switching between
similar precoded video streams without suffering a high penalty for sending a new
I slice. [10]
3.3 Intra Prediction
3.2.5
13
SI Slice
A Switching I (SI) slice is an intra coded slice and supports easy switching between
two different streams that does not correlate. [10]
3.3
Intra Prediction
In intra coding the encoder only uses data from the current frame. Intra prediction
is the next step in this direction to try to minimize the coded frame size. With intra
prediction the encoder tries to utilize the spatial correlation within the frame.[10]
0 (Vertical)
M A B C D E F G H
I
J
K
L
1 (Horizontal)
M A B C D E F G H
I
J
K
L
2 (DC)
M A B C D E F G H
I Mean
J
K (A .. D,
L I .. L)
3 (Diagonal down-left) 4 (Diagonal down-right)
M A B C D E F G H
M A B C D E F G H
I
I
J
J
K
K
L
L
5 (Vertical-right)
M A B C D E F G H
I
J
K
L
6 (Horizontal-down)
M A B C D E F G H
I
J
K
L
8 (Horizontal-up)
M A B C D E F G H
I
J
K
L
7 (Vertical-left)
M A B C D E F G H
I
J
K
L
Figure 3.2: 4x4 luma prediction modes
1 (Horizontal)
H
V
…….
H
V
…….
0 (Vertical)
V
2 (DC)
3 (Plane)
H
H
Mean
(V, H)
V
Figure 3.3: 16x16 luma prediction modes
H.264 supports 9 different intra prediction modes for 4x4 sample luma blocks,
four different modes for 16x16 sample luma blocks and four modes for 8x8 chroma
components. The 9 4x4 prediction modes are illustrated in figure 3.2 and the 4
16x16 luma prediction modes are illustrated in figure 3.3. The pixels are interpolated or extrapolated from the pixels nearby i.e the pixels with letters. Usually
14
Overview of H.264
the encoder selects the prediction mode that minimizes the difference between the
predicted block and the block to be encoded. I_PCM is another prediction mode
which makes it possible to transmit samples of an image without prediction or
transformation. [10, 14]
3.4
Inter Prediction
Inter prediction creates a prediction model from one or more previously encoded
video frames or slices using block-based motion compensation. The motion vector
precision can be up to a quarter pixel resolution. The task is to find a vector that
points to a block of pixels that have the smallest difference between the reference
block and the block in the frame that is being encoded. [10]
16x8
16x16
8x4
8x8
16x8
8x4
4x4 4x4
8x8
8x8
4x8 4x8
4x4 4x4
8x16
8x16
8x8
8x8
Figure 3.4: Different ways to split a macroblock in inter prediction.
H.264 supports a range of block sizes from 16x16 to 4x4 pixels. This is illustrated in figure 3.4. Using big blocks will save data because you will not need as
many motion vectors, but the distortion can be very high when there are a lot
of small things moving around in the video sequence. Using smaller blocks will
in many cases lower the distortion but will instead increase the amount of bits
needed to store the increased number of motion vectors. By letting the encoder
find the best solution for this a good data compression of the video sequence can
be achieved. The blocks are split when a threshold value is reached. [10]
SAD =
m X
n
X
|C(i, j) − R(i, j)|
(3.1)
i=1 j=1
m
n
1 XX
M SE =
(C(i, j) − R(i, j))2
m ∗ n i=1 j=1
(3.2)
3.4 Inter Prediction
15
m
n
1 XX
M AE =
|C(i, j) − R(i, j)|
m ∗ n i=1 j=1
(3.3)
The macroblock cost is commonly calculated in one of a few different ways,
Sum of Absolute Difference (SAD) is the most common as it offers the lowest
computation complexity. The definition of SAD can be found in equation (3.1).
Two other common ways to calculate the cost are Mean Square Error (MSE)
and Mean Absolute Error (MSE) presented in equation (3.2) and equation (3.3)
respectively. In equation (3.1), equation (3.2) and equation (3.3) n is the image
width and m is the image height. [10]
E
F
3
4
K
L
A
1
B
C
2
D
b
f
j
q
s
c H
g
k m
r
N
R
7
S
T
8
U
G
d
h
n
M
a
e
i
p
I
J
5
6
P
Q
Figure 3.5: Subsamples interpolated from neighboring pixels
More accurate motion estimation in form of sub pixel motion vectors is available
in H.264. Up to a quarter pixel resolution is supported for the luma component and
one eighth sample resolution for the chroma components. This motion estimation
is possible to do by interpolating neighboring pixels and then compare with the
current frame in the encoder. The interpolation is performed by a 6 tap Finite Impulse Response (FIR) filter with weights (1/32, −5/32, 20/32, 20/32, −5/32, 1/32).
[10]
In figure 3.5 the half pixel sample b can be located. To generate this sample
equation (3.4) can be used. Sample m can be calculated in a similar way shown
in equation (3.5). [10]
b = round((E − 5F + 20G + 20H − 5I + J)/32)
(3.4)
m = round((B − 5D + 20H + 20N − 5S + U )/32)
(3.5)
16
Overview of H.264
After generating all half pixel samples from real samples there are some half
pixel samples that have not been generated. These samples have to be generated
from already generated samples. The sample j in figure 3.5 is an example of that.
To generate j the same FIR filter is used but with samples 1, 2, b, s, 7 and 8. j
could also be generated with samples 3, 4, h, m, 5 and 6. Note that unrounded
versions of the samples should be used when calculating j. When all half pixel
samples are generated it is time to generate the quarter pixel samples. This is
done by linear interpolation. Sample a in figure 3.5 is calculated as in equation
(3.6) and sample d is calculated as in equation (3.7). To generate the last samples
two diagonal half pixel samples are used, see equation (3.8). [10]
a = round((G + b)/2)
(3.6)
d = round((G + h)/2)
(3.7)
e = round((h + b)/2)
(3.8)
To enhance the video compression even more H.264 has support for predicting
macroblocks from more than one frame. This can be applied to both B and P
coded slices. With the possibility to predict macroblocks from different frames a
much better video compression can be achieved. The downside with multiframe
prediction is an increase cost of memory size, memory bandwidth and computational complexity. [10]
Previous
Frames
Current
Frame
Following
Frames
Figure 3.6: Multiple frame prediction
To find the best motion vector the encoder uses a search algorithm such as Full
Search (FS), Diamond Search or Hexagon Search. With Full Search a complete
search of the whole search area is performed. This algorithm provides the best
compression efficiency but is also the most time consuming algorithm. Diamond
search is a less time consuming search algorithm where the search pattern is formed
as a diamond. Its performance in terms of compression, is good in comparison with
FS. Hexagon search is an even more refined search pattern where the search points
are formed as a hexagon, figure 3.7a. By decreasing the number of search points
the effort to calculate the motion vector will be minimized and the result will be
almost as good as with Diamond Search [16].
Motion estimation is the part in H-264 encoding that consume the most computational power when encoding and is predicted to consume about 60% to 80%
of the total encoding time[15].
3.4 Inter Prediction
3.4.1
17
Hexagon search
Hexagon search uses a 7 point search pattern which can be seen i figure 3.7a.
Each cross in the grid represents a search point in the search area where the grid
resolution is one pixel. From this search point a Sum of Absolute Difference,
equation (3.1), is calculated. [16]
(a)
(b)
Figure 3.7: Large(a) and small(b) search pattern in the hexagon search algorithm.
The search steps in the hexagon search are the following.
1. Calculate the SAD of the six closest search points and the current search
point.
2. Put the search point with the smallest SAD as new current search point. If
the middle point has the smallest SAD jump to step 5.
3. Calculate the SAD of the 3 new search points that have not yet been calculated as illustrated in figure 3.8.
4. Jump to step 2
5. Calculate the SAD of the 4 new search points forming a diamond around the
middle point. This is illustrated in figure 3.7b.
6. Choose the search point that resulted in the smallest SAD and form a motion
vector to this search point.
When the smallest SAD is found the motion compensated residue can be calculated. This residue is then sent to the transformation part of the encoder for
further processing. In the decoder the motion vectors are used to restore the image
correctly from the residue that was sent from the encoder. [16]
18
Overview of H.264
3
3
4
5
2
2
5
3
5
4
5
1
1
1
1
1
2
4
1
1
Figure 3.8: Movement of the hexagon pattern in a search area and the change to
the smaller search pattern.
3.5
Transform Coding and Quantization
The main transform used in H.264 is discrete cosine transform.
3.5.1
Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is a widely used transform in image and
video compression algorithms. In H.264 the DCT decorrelates the residual data
before quantization takes place. The DCT is a block based algorithm which means
it transforms one block at the time. In prior standards to H.264 the blocks were
8x8 pixels large but that is now changed to 4x4 samples to reduce the blocking
effects, which reduces the visual quality in the video. The DCT used in H.264 is
a modified two-dimensional (2D) DCT transform. The transform matrix for the
modified 2D DCT can be found in equation (3.9). [10]

1
2

Cf = 
1
1
1
1
−1
−2
1
−1
−1
2

1
−2

1
−1
The 2D DCT transform in H.264 is given by equation (3.10)
(3.9)
3.5 Transform Coding and Quantization
Y = Cf XCfT ⊗ Ef

1 1
1
2 1 −1

=
1 −1 −1
1 −2 2
19
=
 
1
1
1
−2
X 
1  1
−1
1
2
1
−1
−2
1
−1
−1
1
  2
a
1
  ab
−2
 ⊗  2
2  a2
ab
−1
2
ab
2
b2
4
ab
2
b2
4
a2
ab
2
2
a
ab
2
ab 
2
b2 
4 
ab 
2
b2
4
(3.10)
where
1
a=
r2
2
b=
5
(3.11)
(3.12)
and X is the 4x4 block of pixels to calculate the DCT of. To simplify computation somewhat the post-scaling (⊗Ef ) can be absorbed into the quantization
process. [10] This will be described in more detail in section 3.5.3 which covers
the quantization.
The modified 2D DCT is an approximation to the standard DCT. It does not
give the same result but the compression is almost identical. The advantages
with this approximation is that the core equation Cf XCfT can be done in 16 bit
arithmetics with only shifts, additions and subtractions [6].
To do a two-dimensional DCT two one-dimensional DCTs can be performed
after each other, the first one on rows and the second one on columns or vice versa.
The function of the one-dimensional DCT can be seen in figure 3.9. [6]
x0
+
x1
+
x2
-
+
x3
-
+
-2
2
+
X0
+
X2
+
X1
+
X3
Figure 3.9: DCT functional schematic
The operations performed while calculating the DCT as shown in figure 3.9
can be written as equation (3.13).
X0 = (x0 + x3 ) + (x1 + x2 )
X2 = (x0 + x3 ) − (x1 + x2 )
X1 = (x1 − x2 ) + 2(x0 − x3 )
X3 = (x1 − x2 ) − 2(x0 − x3 )
(3.13)
20
3.5.2
Overview of H.264
Inverse Discrete Cosine Transform
The transform that reverses DCT is called Inverse Discrete Cosine Transform
(IDCT). With the design of the DCT in H.264 it is possible to ensure zero mismatch
between different decoders. This is because the DCT and IDCT(3.14) can be
calculated in integer arithmetics. In the standard DCT some mismatch can occur
caused by different representation and precision of fractional numbers in encoder
and decoder. [10]
The 2D IDCT transform in H.264 is given by
Xr = CiT (Y ⊗ Ei )Ci =

 2

1
1 1
1
a
2
1

ab
1
−1
−1
2


=
1 − 1 −1 1  Y ⊗ a2
2
ab
1 −1 1 − 12
ab a2
b2 ab
ab a2
b2 ab
 
ab ! 1

b2 
 1
ab  1
1
b2
2
1
1
2
−1
−1
1
− 21
−1
1

1
−1 
 (3.14)
1 
− 12
where Xr is the reconstructed original block and Y is the previously transformed
block. As with the DCT the pre-scaling (⊗Ei ) can be absorbed into the rescaling
process. [10] This will be described in more detail in section 3.5.4 which covers
the rescaling.
X0
X2
-
X1
1/2
X3
1/2
+
+
x0
+
+
x1
+
-
+
x2
+
-
+
x3
Figure 3.10: IDCT functional schematic
The function of the IDCT can be seen in figure 3.10. To do a two-dimensional
IDCT two one-dimensional IDCTs are performed after each other, the first one on
rows and the second one on columns or vice versa. [6] The operations performed
while calculating the IDCT can be written as equation (3.15).
1
x0 = (X0 + X2 ) + (X1 + X3 )
2
1
x1 = (X0 − X2 ) + ( X1 − X3 )
2
1
x2 = (X0 − X2 ) − ( X1 − X3 )
2
1
x3 = (X0 + X2 ) − (X1 + X3 )
2
(3.15)
3.5 Transform Coding and Quantization
3.5.3
21
Quantization
Information is often concentrated to the lower frequency area, therefore quantization can be used to further compress the data after applying the DCT. H.264
uses a parameter in the quantization called Quantization Parameter (QP). The
QP describes how much quantization that should be applied i.e. how much data
that should be truncated. A total of 52 values ranging from 0 to 51 are supported
by the H.264 standard. Using a high QP will decrease the coded data in size but it
will also decrease visual quality of the coded video. With QP = 0 the quantization
will be zero and all data is kept. [10]
From QP the quantizer step size (Qstep ) can be derived. The first values of
Qstep is presented in table 3.1. Note that Qstep doubles in value for every increase
of 6 in QP. The large number of step sizes provides the ability to accurately control
the trade off between bitrate and quality in the encoder. [10]
QP
Qstep
0
0.625
1
0.6875
2
0.8175
3
0.875
4
1
5
1.125
6
1.25
7
1.375
8
1.625
...
...
Table 3.1: Qstep for a few different values of QP
The basic formula for quantization can be written as
!
Yij
Zij = round
Qstep
(3.16)
where Yij is a coefficient of the previously transformed block to be quantized and
Zij is a coefficient of the quantized block. The rounding operation does not have
to be to the nearest integer, it could be biased towards smaller integers which
could give perceptually higher quality. This is true for all rounding operations in
the quantization. [10]
As mentioned in section 3.5.1 the quantization can absorb the post-scaling
(⊗Ef ) from the DCT. The unscaled output from the DCT can then be written
as W = Cf XCfT (as compared to the scaled output which is Y = Cf XCfT ⊗ Ef ).
[10] This gives
!
P Fij
Zij = round Wij
(3.17)
Qstep
where Wij is a coefficient of the unscaled transformed block, Zij is a coefficient
b2
of the quantized block and P Fij is either a2 , ab
2 or 4 for each (i,j) according to

a2
 ab
2
PF = 
a2
ab
2
ab
2
b2
4
ab
2
b2
4
a2
ab
2
2
a
ab
2
ab 
2
b2 
4 
ab 
2
b2
4
where a and b are the same as in equation (3.10) in section 3.5.1. [10]
(3.18)
22
Overview of H.264
PF and Qstep can then be reformulated using a multiplication factor (MF) and
a division. MF is in fact a 4 × 4 matrix of multiplication factors according to


A C A C
C B C B 

MF = 
(3.19)
A C A C 
C B C B
where the values of A, B and C depends on QP according to
QP
0
1
2
3
4
5
A
13107
11916
10082
9362
8192
7282
B
5243
4660
4194
3647
3355
2893
C
8066
7490
6554
5825
5243
4559
Table 3.2: Multiplication factor MF
The scaling factors in MF are repeated for every increase of 6 in QP. The
reformulation of PF and Qstep then becomes
PF
MF
= qbits
Qstep
2
(3.20)
where qbits is calculated as
qbits = 15 + f loor
QP
6
This gives a new quantization formula according to
!
M Fij
Zij = round Wij qbits
2
(3.21)
(3.22)
which is the final form. [10]
3.5.4
Rescaling
The rescaling also uses Qstep which depends on the Quantization Parameter (QP)
and is the same as for quantization (see table 3.1). The basic formula for rescaling
can be written as
Yij0 = Zij Qstep
(3.23)
where Zij is a coefficient of the previously quantized block and Yij0 is a coefficient
of the rescaled block. The rounding operation, as in the quantizer, does not have
to be to the nearest integer, it could be biased towards smaller integers which
3.6 Deblocking filter
23
could give perceptually higher quality. This is true for all rounding operations in
the rescaling. [10]
As the quantization formula was reformulated the rescaling formula can also
absorb the pre-scaling (⊗Ei ) and be reformulated to match the quantization formula. The new formula for rescaling where the pre-scaling factor is included can
be written as
Wij0 = Zij Qstep P Fij ∗ 64
(3.24)
where P Fij is the same as in (3.18), Zij is a coefficient of the previously quantized
block, Wij0 is a coefficient of the rescaled block and the constant scaling factor of
64 is included to avoid rounding errors while calculating the Inverse DCT. [10]
Much like MF for the quantization the rescaling also uses a 4 × 4 matrix of
scaling factors called V, which also incorporates the constant scaling factor of 64
introduced in (3.24). V can be written as


A C A C
C B C B 

(3.25)
V =
A C A C 
C B C B
where the values of A, B and C depends on QP according to
QP
0
1
2
3
4
5
A
10
11
13
14
16
18
B
16
18
20
23
25
29
C
13
14
16
18
20
23
Table 3.3: Scaling factor V
The scaling factors in V are like MF repeated for every increase of 6 in QP.
With V the rescaling formula can be written as
Wij0 = Zij Vij 2f loor(QP/6)
(3.26)
which is the final form. [10]
3.6
Deblocking filter
When using block coding algorithms such as DCT, blocking artifacts can occur.
This is unwanted because it lowers the visual quality and prediction performance.
The solution to this is to add a filter than removes these artifacts. The filter is
placed after the IDCT in the encoding loop which can be seen i figure 3.1. The
filter is used on both luma and chroma samples of the video sequence. [10]
24
Overview of H.264
E
F
G
3
H
4
A
B
C
D
1
(a)
2
(b)
Figure 3.11: Filtering order of a 16x16 pixel macroblock with start in A and end
in H for luminance(a) and start in 1 and end in 4 for chrominance(b)
The deblocking filter in H.264 has 5 levels of filtering, 0 to 4, where 4 is the
option with the strongest filtering. The filter is actually two different filters where
the first filter is applied on level 1 to 3 and the second on level 4. Level 0 means that
no filter should be applied. The filter level parameter is called boundary strength
(bS). The parameter depends on the current quantization parameter, macroblock
type and the gradient of the image samples across the boundary. There is one bS
for every boundary between two 4x4 pixel block. The deblocking filter is applied
to one macroblock at a time in a raster scan order throughout the frame. [5]
p3
p2
p1
p0
p3
p2
p1
p0
q0
q1
q2
q3
q0
q1
q2
q3
Figure 3.12: Pixels in blocks adjacent to vertical and horizontal boundaries
When applying the deblocking filter on a macroblock it is done in a special order
which is illustrated in figure 3.11. The filter is applied on vertical and horizontal
edges as shown in figure 3.12. Where p0 , p1 , p2 , p3 , q0 , q1 , q2 , q3 are pixels from
two neighboring blocks, p and q. The filtering of these pixels only takes place if
equation (3.27), (3.28) and (3.29) are fulfilled.
3.7 Entropy coding
25
|p0 − q0 | < α(indexA )
(3.27)
|p1 − p0 | < β(indexB )
(3.28)
|q1 − q0 | < β(indexB )
(3.29)
indexA = M in(M ax(0, QP + Of f setA ), 51)
(3.30)
indexB = M in(M ax(0, QP + Of f setB ), 51)
(3.31)
The values of α and β are approximately defined to equation (3.32) and equation (3.33).
x
α(x) = 0.8(2 6 − 1)
(3.32)
β(x) = 0.5x − 7
(3.33)
Note that in equation (3.30) and (3.31) it can be seen that the filtering is
dependent on the Quantization Parameter. The different filters applied are 3-,4and 5-tap FIR filters which are further described in. [5]
3.7
Entropy coding
The H.264 standard supports two different entropy coding algorithms, Contextbased Adaptive Variable Length Coding (CAVLC) and Context-based Adaptive
Binary Arithmetic Coding (CABAC). CABAC is the most efficient of these two
standards but it requires higher computational complexity. Bitrate savings of
CABAC can be between 9% and 14% compared to CAVLC[7]. CAVLC is supported in all H.264 profiles but CABAC is only supported in the profiles above
extended. [10]
Chapter 4
Overview of the ePUMA
Architecture
This chapter covers an introduction to the ePUMA processor architecture. The
memory hierarchy, master core, Sleipnir core, the direct memory access controller
and the simulator will be covered.
4.1
Introduction to ePUMA
Embedded Parallel Digital Signal Processing Processor Architecture with Unique
Memory Access (ePUMA) is a multi-core DSP processor architecture with 1 master
core and 8 calculation cores. The master core handles the Direct Memory Access
(DMA) communications. The slave core, which is also called Sleipnir, is a 15-stage
pipelined calculation core.
4.2
ePUMA Memory Hierarchy
The ePUMA memory hierarchy consists of three levels where the first level is the
off-chip main memory, the second level is the local storage of the master and slaves
and the third and final level is the registers of the master and slave cores. In figure
4.1 an illustration of how each core is connected to the on-chip interconnection is
depicted. The on-chip interconnection is in turn connected to the off-chip main
memory. The main memory is addressed with both a high word of 16 bits and a
low word of another 16 bits which means that a 32-bit addressing is used where
each address corresponds to a word of data.
27
28
Overview of the ePUMA Architecture
Off chip main memory
Level 1
On chip interconnection
Registers
Registers
Registers
Master Core
Sleipnir Core
Sleipnir Core
LVM 3
LVM 2
LVM 1
CM
...
PM
Sleipnir 7 LS
LVM 3
LVM 2
LVM 1
CM
Sleipnir 0 LS
PM
DM 1
DM 0
PM
Master LS
Level 2
Level 3
Figure 4.1: ePUMA memory hierarchy
The on-chip network is depicted in figure 4.2 where N0 to N7 are interconnection nodes. As can be seen from the figure the nodes are connected both to
the master and the respective Sleipnir core but also to other nodes. This gives
the ability of transferring data between Sleipnir cores and even pipeline the cores.
With this setup data can be transferred in any way and combination that does
not overlap.
Sleipnir 0
Sleipnir 3
Sleipnir 5
Sleipnir 1
Sleipnir 2
N0
N1
N2
N3
Master
DMA
Main Memory
N4
N5
N6
N7
Sleipnir 6
Sleipnir 4
Sleipnir 7
Figure 4.2: ePUMA star network interconnection
4.3 Master Core
4.3
29
Master Core
The master core is for the moment based on a processor called Senior. This
processor has been around on the Division of Computer Engineering for some years
now and is used in some courses for educational purpose. The Senior processor is
a DSP processor which means it got a Multiply and ACcumulate (MAC) unit and
other DSP related capabilities. To accomplish a possibility to serve as a master
core memory ports for DMA controller and interrupt coming from the DMA and
Sleipnir cores have been added.
4.3.1
Master Memory Architecture
The master core has 2 RAMs and 2 ROMs which are called Data Memory 0 (DM
0) and Data Memory 1 (DM 1). These memories are the local storage for the
master core. The ROMs start at address 8000 on respective memory. This gives
7F F F = 32767 words in each RAM to work with.
For calculation the master core has 32 16-bit registers that could be used as
buffers. There are also a number of special registers such as 4 address registers,
registers for hardware looping and registers for support of cyclic addressing in
address register 0 and 1. Address register 0 and 1 also supports different step
sizes.
4.3.2
Master Instruction Set
Programming guide and instruction set for Senior can be found in [9] and [8] even
though they might not be totally accurate because of the modifications for the
ePUMA project. The masters instructions set is in large the same as the Senior
instruction set. It is a standard DSP instruction set with support for a convolution
instruction which multiplies and accumulates the results. To speed up looping a
hardware loop function called repeat is included. All jumps, calls and returns can
use 0 to 3 delay slots. The number of delay slots specifies how many instructions
after the flow control instruction that will be executed. If not all delay slots are
used for useful instructions, nop instructions will be inserted in the pipeline.
4.3.3
Datapath
The datapath of the master consists of a 5-stage pipeline which can be seen in
figure 4.3. There is only one exception to this, the convolution instruction (conv)
uses a 7-stage pipeline but a figure of this is omitted for lack of relevance. The
datapath is advanced enough for scalar calculations, larger computational loads
should be delegated to the Sleipnir cores. In table 4.1, originally found in [9], a
description of the pipeline stages is presented.
30
Overview of the ePUMA Architecture
Next PC
P1
PM
P2
Decoder
. . .
RF
P3
OP. SEL
AGU
ALU flags
P4
ALU
ACR, MAC
flags
P5
*
DM 0
DM 1
Cond.
Check
+
Figure 4.3: Senior datapath for short instructions
Pipe
P1
P2
P3
P4
P5
RISC-E1/E2
IF: Instr. Fetch
ID: Instr. Decode
OF: Operand Fetch
EX1: Execution(set flags
EX2: Only for MAC, RWB
RISC Memory load/store
IF: Instr. Fetch
ID: Instr. Decode
OF+AG: Compute addr
MEM: Read/Write
WB: Write back (if load)
Table 4.1: Pipeline specification
4.4
Sleipnir Core
Sleipnir is the name of the calculation core. In the ePUMA processor there are 8
of them. The Sleipnir is a Single Instruction Multiple Data (SIMD) architecture
which in this case means it can perform vector calculations. Each full vector
consists of 128 bits and is divided into 8 words of 16 bits which can run through the
pipeline in parallel. The datapath of the Sleipnir core has 15 pipeline stages. The
pipeline length of an instruction is variable depending on the choice of operands.
4.4 Sleipnir Core
4.4.1
31
Sleipnir Memory Architecture
The Sleipnir core has 3 memories where 2 of them are connected to the core and the
third memory is connected to the DMA bus. The memories are called Local Vector
Memories (LVMs). By being able to swap which memories that are connected to
the processor and which memory that is connected to the DMA better utilization
can be reached and a lot of the transfer cycle cost can be hidden.
Constant Memory
Each Sleipnir is also provided with a Constant Memory (CM) for use of constants
during runtime. This memory can be used for different tasks such as scalar constants or permutation vectors. All constants that will be used during runtime can
be stored in the CM. The memory can contain up to 256 vectors.
Local Vector Memory
The Local Vector Memories (LVM) are the local memories of the Sleipnir core. As
described above each core has access to 2 LVMs at runtime. These memories are
4096 vectors large, where each vector is 128 bits wide. The memories have one
address for each word of 16 bits. The memories consist of 8 memory banks, one
for each word in a vector. The constant memory can be used to address the LVMs
according to the values stored in the constant memory. The constant memory
addressing of the LVMs can be used to generate a permutation of data which can
be used for e.g. transposing a matrix.
Vector Registers File
There are 8 Vector Registers (VR) in the Vector Register File (VRF), VR0 to VR7,
for use in computations during runtime. Each word can be obtained separately,
it is also possible to obtain a double word and half vector both high and low in
each of the 8 vector registers. The different access types are listed in table 4.2,
originally found in [4].
Syntax
vrX.Y
vrX.Yd
vrX{h,l}
vrX
Size
16-bit
32-bit
64-bit
128-bit
Description
Word
Double word
Half vector
Vector
Table 4.2: Register file access types
Special Registers
There are 4 address register ar0-ar3 which can be used to address memory in
the LVMs. There are also 4 configuration registers for these 4 address registers.
The subset of these registers are values for top, bottom and step size which can
32
Overview of the ePUMA Architecture
be used when addressing memories in all kinds of loops. The different increment
operations are listed in table 4.3, originally found in [4].
arX+=C
arX-=C
arX+=S
arX+=C%
arX-=C%
arX+=%
Fixed increment; C = 1,2,4 or 8
Fixed decrement; C = 1,2,4 or 8
Increment from stepX register
Fixed increment with cyclic addressing
Fixed decrement with cyclic addressing
Increment from stepX with cyclic addressing
Table 4.3: Address register increment operations
The addressing of the two LVMs can be done with one of the four address
registers, immediate addresses, vector registers or in combination with the constant
memory, to form advanced addressing schemes as shown in table 4.4, originally
found in [4].
Mode#
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Index
arX
arX
arX
arX
0
0
0
0
0
0
0
0
arX
arX
arX
0
Offset
0
0
0
0
vrX.Y
vrX.Y
vrX.Y
vrX.Y
0
0
0
0
0
vrX.Y
imm16
imm16
Pattern
0,1,2,3,4,5,6,7
cm[carX]
cm[imm8]
cm[carX + imm8]
0,1,2,3,4,5,6,7
cm[carX]
cm[imm8]
cm[carX + imm8]
vrX
cm[carX]
cm[imm8]
cm[carX + imm8]
vrX
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7
Syntax example
[ar0]
[ar0 + cm[car0]]
[ar0 + cm[10]]
[ar0 + cm[car0 + 10]]
[vr0.0]
[vr0.0 + cm[car0]]
[vr0.0 + cm[10]]
[vr0.0 + cm[car0 + 10]]
[vr0]
[cm[car0]]
[cm[10]]
[cm[car0 + 10]]
[ar0 + vr0]
[ar0 + vr0.0]
[ar0 + 1024]
[1024]
Table 4.4: Addressing modes examples
Program memory
The program memory (PM) can contain up to 512 instructions. It can be loaded
from the main memory by issuing a DMA transaction.
The program that is loaded into the Sleipnir PM is called a block. A kernel
is a combination of master code and blocks. A block can utilize several Sleipnir
cores with internal data transfers. Blocks can however not communicate with cores
outside the block and can not be data dependant on any other block running at
the same time.
4.4 Sleipnir Core
33
If for some reason the Sleipnir block code is larger than 512 lines of instructions
it can be divided into two programs and the memory can be transferred between
two Sleipnir cores. For this to work code is needed in the master to keep track of
the cores and move data to the next core for further processing. When developing
a new block or kernel it can sometimes be good to have a little extra memory.
Therefore it is possible to increase the size of the PM in the simulator.
4.4.2
Datapath
The datapath of the Sleipnir slave core is a 8-way 16-bit datapath. The datapath
is divided into 15 pipeline stages and is depicted in figure 4.4. A more detailed
version of the datapath can be found in [2].
A1
Instr. Fetch
A2
Instr. Decode
E1
B1
E2
B2
E3
B3
E4
B4
C1
D1
D2
D3
D4
CM Addressing
LVM Scalar Addressing
CM x
CM y
LVM Vector Addressing
LVM x
LVM y
VRF
SPRF
Operand Selection
Operand Formatting
Multiplier
ALU 1
ALU 2
Figure 4.4: Sleipnir datapath pipeline schematic
The datapath includes 16 16x16-bit multipliers and two Arithmetic Logic Units
(ALU) connected in series. Simpler instructions can bypass the first ALU and
by that become a shorter instruction which saves some execution time. These
bypasses can be seen in stage D1 to D4 in figure 4.4. Some instructions use a very
short datapath such as the jump instruction which is executed in stage A2. This
makes the use of precalculated branch decisions unnecessary. Stage E1 to E4 can
be described as the write back stage and therefore it follows after stage D4. Stage
D3 and D4 are very similar but provides the core with the possibility of performing
summation of a complete vector and similar tasks.
34
Overview of the ePUMA Architecture
4.4.3
Sleipnir Instruction Set
The instruction set used is application specific. The instruction set includes no
move or load instructions for data. These functions are all included in one instruction which is called copy. Operands and instructions can be combined in different
ways with variable pipeline length as a result. The pipeline length depends on
e.g. where the input operands are fetched from, where the result will be stored
and if the instruction uses or bypasses the first ALU and multipliers. Instruction
names are built upon what data they affect and how. For example the instruction
vcopy m0[0].vw m1[0].vw copies a vector from memory 1 address 0 to memory 0 address 0. If the instruction scopy would be used instead it would only
copy a scalar word. Another example is the add instruction. If vaddw m0[0].vw
m1[0].vw vr0 is used two vectors will be loaded from both m1 and vr0. The .vw
after the memory address denotes that the vectors will be added word wise, that
means they will be considered as eight words. This means that the processor can
carry out 8 additions per clock cycle. [4]
4.4.4
Complex Instructions
To reach better performance results the datapath has to be utilized as much as
possible, especially in the inner loops of the critical path. To be able to reach
this better performance, new specialized instructions that perform several smaller
tasks could be implemented. The result of this is that by pipelining several of these
new complex instructions more work can be done in less time and the program
will reach an increased throughput.
Things that have been considered when deciding upon accelerating certain
parts of code are listed below.
• Motivation – Why should the acceleration be done
• Description – What is going to be accelerated
• Extra hardware needed – What extra hardware is needed for acceleration of
the specific task
• Profiling and usage – Is the task used a lot and therefore worth accelerating
• Extra hardware cost – What is the cost of the extra hardware
• Cycle gain – How many cycles can be saved
• Efficiency – How efficient is the new solution in terms of cost per gain in
performance
4.5
DMA Controller
The Direct Memory Access(DMA) controller is used to load and store data to and
from an off-chip memory. The DMA can transfer a 128-bit vector to one of the
4.6 Simulator
35
Sleipnirs every cycle. It can also broadcast data to one or more Sleipnirs. If a
block is to be loaded in to two or more cores it can be broadcasted and the process
will not lose cycles by loading each block separately to each core. This will save
time both because of the time it takes to copy the data but also because it takes
some cycles to configure and start the DMA transaction.
Sleipnir Local Store
DMA
PM
NoC
CM
Switch 2
LVM 1
Switch 1
Sleipnir
Core
LVM 2
LVM 3
Figure 4.5: Sleipnir Local Store switch
As mentioned before there are 3 memories belonging to each core. There are
6 different setups for how the memories are connected to the DMA and the core.
The switch which control this is illustrated in figure 4.5, originally described in
[2]. There is also a switch for selecting if LVM, PM or CM should be connected
to the DMA. This switch is used and changed accordingly when programming the
Sleipnir cores PM, CM or LVM. To initiate a DMA transaction the DMA unit needs
to be configured. This configuration includes start addresses in both memories,
number of vectors to be transferred, how to access data, step size in memory,
switch configuration and broadcast configuration. The DMA has support for 2D
accesses in main memory which can be helpful when advanced access patterns are
used. When the configuration of the DMA unit is done the task can be started.
4.6
Simulator
The ePUMA architecture has a full system simulator available. The simulator is
bit and pipeline true. Simulations can be done on either one standalone Sleipnir
core or on the full system simulator where master core and all 8 Sleipnir cores are
included.
36
Overview of the ePUMA Architecture
The simulator can be invoked by a python script. The simulator has a number
of functions that could be used to access LVM contents, address registers, vector
registers, program counter and the instruction that is being executed [3]. This
can be used both for debugging and profiling. Interrupts from DMA and the
Sleipnir cores can also be caught by the same python script. Opportunities like
pre-processing of the input and post-processing of the results directly in the python
script are also available.
The simulator has support for different modes of simulation, either simulation
until an event occurs or simulation of one cycle at a time. When simulating until
an event happens, these events need to be enabled in the simulator. Events that
are possible to enable are starting and stopping of a specific Sleipnir core, memory
out of range and data hazards such as read before write. Simulating one cycle at
a time offers more opportunities to evaluate each step in an execution.
To get the simulator to carry out a full system simulation it needs input such as
master code, Sleipnir code and the data that is going to be used during runtime.
These are possible to add before the simulation begins. Allocating memory for
results in main memory is also possible.
Chapter 5
Elaboration of Objectives
This chapter gives a more detailed task specification by using knowledge acquired
from the previous theoretical chapters. It also includes the method and the taken
procedure.
5.1
Task Specification
The main task at hand is to evaluate a new processor architecture and how capable
it is in aspect of H.264 video encoding using the available system simulator. The
evaluation will be done by developing selected parts of an H.264 encoder. Most
weight should be put into evaluating the more computational intensive parts which
are likely to make out the bottleneck of the encoding.
To implement an encoder, or parts of it, that uses the H.264 standard a thorough understanding of both the H.264 standard’s core parts and the ePUMA
processor architecture, tool chain and instruction set is needed. This information
and understanding has to be acquired first.
The video focused upon will be 1080p full high-definition (HD) video at a rate
of 30 frames per second (FPS) using the 4:2:0 sampling format encoding which
was described in section 2.2. The video frames that calculations will be performed
on are presumed to be stored with 8 bits per pixel in the main memory.
Once performance results have been acquired, possible areas of improvement
can be exposed. Different ways of improvement can then be compared, both in
terms of performance improvement and the estimated extra hardware needed, to
give a measurement of efficiency. The results will also be compared to the results
from the H.264 encoder for the STI Cell architecture which are presented in [15].
Other parts that will be evaluated include the Discrete Cosine Transform
(DCT), Inverse Discrete Cosine Transform (IDCT), Quantization and Rescaling.
37
38
Elaboration of Objectives
5.1.1
Questions at Issue
The following questions were derived from the purpose in section 1.2 and the task
specification.
• Is it possible to perform real-time full HD video encoding at 30 FPS using
the H.264 standard in the ePUMA processor?
• Would it be possible to modify the processor architecture to reach better
performance and if so, would it be worth the cost of the potentially added
hardware?
• What are the cycle costs compared to the STI Cell H.264 encoder?
5.2
Method
The main method used to conduct the work has been to use the ePUMA system
simulator. The simulator was invoked from a script written in the Python programming language. This gave enough flexibility to enable measurement of all the
results as well as ability to make testing automatic within that same script.
If the sole purpose of this thesis would have been to give performance measurements the method might have had other candidates such as hand calculations of
cycle costs. As the purpose for this thesis also includes functional implementations
using the simulator is the choice that offers the best validity if used correctly.
If the implementation of the simulator is correct according to the proposed
architecture, it will give measurements of high reliability. This is based on the fact
that the simulator is pipeline true, cycle- and bit-correct.
5.3
Procedure
The taken procedure while working on this thesis started with a study of video
coding, the H.264 standard and the ePUMA processor architecture. Once the
required information had been acquired a functional stand-alone Sleipnir block
for motion estimation was developed. When the block was found to be correct
and working, code for the master was developed. Then the master could start to
run the motion estimation using the Sleipnir block. From this point the motion
estimation kernel was developed with various stepwise improvements. The master
code was also developed to be able to run the different versions of the kernel using a
variable number of slave cores. Then the construction of the other Sleipnir blocks
such as DCT, IDCT, Quantization and Rescaling began. Once all blocks were
implemented performance measurements could be acquired and the results could
be analyzed to give conclusions and answer the questions at issue.
Chapter 6
Implementation
This chapter covers how the implementation of different kernels and blocks were
done, how they evolved, the different decisions that were made and why.
6.1
Motion Estimation
Motion estimation was found to be the prime target for performance evaluation
as it, in nearly all cases, takes up the majority of the encoding cycle time. All
implementations of motion estimation is done for a 65 macroblocks high and 118
macroblocks wide frame. The reason for this is because it simplifies the implementation as the corners and sides of a frame make out special cases. The number
of macroblocks left out by doing this simplification is 430 compared to the total
number of 120 ∗ 67.5 = 8100 macroblocks for a full HD frame. This corresponds
to 5.31% and still leaves 7670 macroblocks to perform calculations on. The search
area was chosen as (−15, 15)×(−15, 15) according to what was used in [15] to yield
as comparable results as possible. Another simplification of the motion estimation
is that it is only performed on entire macroblocks, no further division into e.g.
16 × 8, 8 × 8 or 4 × 4 pixels is performed. The reason for this is that it might
not be feasible to perform these calculations on a low-power architecture such as
ePUMA without increasing the clock frequency. Doing so would be counter productive in a low-power point of view and, even if it would be doable, might not be
applicable to hand held devices running on batteries.
6.1.1
Motion Estimation Reference
In order to evaluate the results produced by the motion estimation kernels a reference motion estimation program was written in the Python scripting language.
By comparing the resulting motion vectors and costs in an automated fashion the
functionality of the kernels could be verified with little effort.
39
40
Implementation
6.1.2
Complex Instructions
The function of the innermost loop of motion estimation can be described as
follows:
1. 16 ∗ 16 = 256 subtractions of 8-bit unsigned numbers.
2. Calculate the absolute value of each subtraction result.
3. Sum all absolute values together to one final sum.
This gives a total of 256 subtraction operations (SUB), 256 absolute value (ABS)
operations and 255 addition operations (ADD) which is equal to 767 operations.
This theoretically corresponds to 32 vector word SUB instructions, 32 vector word
ABS instructions and 37 vector word SUM instructions in the Sleipnir core. Theoretically those instructions would need a total of 32 + 32 + 37 = 101 cycles in the
Sleipnir core.
By examining the Sleipnir datapath, as can be seen in figure 4.4, it can be found
that several of the necessary operations could be done in series in a pipelined
fashion. By exploiting this a new complex instruction could be constructed as
mentioned in section 4.4.4.
In addition, by having the operand selection and operand formatting parts
of the pipeline able to fetch 8-bit unsigned numbers from the operands and feed
them to the datapath as 16-bit unsigned numbers, a further reduction of cycle
time could be achieved. By utilizing the datapath to this extent two scalar words
will be the partially summed up result from the complex instruction. This means
another 9 vector word SUM instructions will still be needed which gives a total
theoretical computation time of 32 + 9 = 41 cycles for calculating one macroblock
sum of absolute difference (MB SAD).
By studying the hexagon search algorithm (section 3.4.1) it can be seen that
the algorithm will need to calculate the sum of absolute difference between two
macroblocks (MB SAD) a number of times equal to 7 + 3 ∗ n + 4, where n is the
number of steps taken. It is also known that there will be 8100 macroblocks to
perform a hexagon search upon in each frame. A summary of the considerations
taken, as mentioned in section 4.4.4, is listed below.
• Motivation – Innermost loop of motion estimation.
• Description – Perform absolute difference and partial sum.
• Extra hardware needed – None, or possibly operand 8-bit selection.
• Profiling and usage – Used (7 + 3 ∗ n + 4) ∗ 8100 times per frame.
• Extra hardware cost – None, or affordable.
• Cycle gains – Theoretically 101 − 41 = 60 for each MB SAD.
• Efficiency, gain per cost – Very high.
6.1 Motion Estimation
41
Having analyzed the innermost loop of the motion estimation leads to the conclusion that using complex instructions could give a real boost to performance.
The new instructions specific for motion estimation were named HVBSUMABSDWA
and HVBSUMABSDNA which can be read out as “Half Vector Bytewise SUM of ABSolute Differences Word Aligned” and “Half Vector Bytewise SUM of ABSolute
Differences Not word Aligned” respectively. The proposed hardware setup of the
datapath of these two instructions are depicted in Appendix A.1 and A.2. As can
be seen from the figures the instructions do not use the full width of the operands,
because the data is stored bytewise in the memory. The datapath will still be
fully utilized as the 8-bit input pixel values are promoted to 16-bits values in the
8 computation lanes of the Sleipnir datapath.
In addition to the motion estimation instructions, two more instructions can follow without any essential additional cost. These instructions were named HVBSUBWA
and HVBSUBNA which can be read out as “Half Vector Bytewise SUBtraction Word
Aligned” and “Half Vector Bytewise SUBtraction Not word Aligned” respectively.
These instructions are used in motion compensation as the subtraction results
have to be kept intact to produce the residue frame. The implementation of these
instructions can be seen in Appendix A.3 and A.4.
6.1.3
Sleipnir Blocks
The hexagon search Sleipnir block was the first part to be implemented, at first
only focusing on performing calculations on one macroblock at a time. The input
needed to perform calculations is one macroblock from the new frame and a larger
chunk of data from the previous frame, the reference, that makes out the search
area. All motion estimation blocks are divided into smaller functions and the
program flowchart is depicted in figure 6.1.
When execution starts the program will calculate the Sum of Absolute Difference (SAD) for the first 7 search points MID, LEFT, RIGHT, UP LEFT, UP
RIGHT, DOWN LEFT and DOWN RIGHT as shown in figure 6.1. Once the
first 7 SAD costs have been calculated the program reaches the main loop and
the MIN function determines which cost is lowest and moves on to one of the 7
corresponding MIN functions.
The 6 MIN functions (MIN LEFT, MIN RIGHT, MIN UP LEFT, ...) updates
the motion vectors and data addresses and then moves on to the corresponding 3
new search points. Once the SAD costs of these 3 new search points have been
calculated the MIN function is again used to find the new minimum cost and the
loop continues.
The MIN MID state is reached if the middle point was found to be the search
point with the lowest SAD cost. When this happens Phase 2 (P2) of the algorithm
starts which means that the searchpattern is changed to the small hexagon (figure
3.7). Once the final 4 search points have been calculated the smallest cost amongst
them is found by the P2 MIN function and the final motion vector is calculated. For
the Sleipnir blocks that does not use the Motion Compensation (MC) function the
DONE/RESTART state is reached and if MC is used it will be calculated before
the block finishes.
42
Implementation
The blocks calculating on one macroblock reach the DONE stage and finalize
their execution. The RESTART function is naturally only used by the blocks
calculating on more than one macroblock per execution. If the DONE/RESTART
function is reached the program starts over from START/RESTART if there are
more macroblocks left to compute, otherwise they reach DONE and finalize their
execution. The “P2” in some function names indicate that they are used in phase
two of the search algorithm when the small hexagon pattern, discussed in section
3.4.1, is used.
START/
RESTART
MID
LEFT
RIGHT
UP LEFT
DOWN
RIGHT
MIN
UP
RIGHT
DOWN
LEFT
MIN
DOWN
RIGHT
MIN
DOWN
LEFT
MIN UP
RIGHT
MIN UP
LEFT
MIN
RIGHT
MIN
LEFT
MIN MID
DOWN
RIGHT
LEFT
RIGHT
UP LEFT
RIGHT
LEFT
P2 UP
RIGHT
DOWN
LEFT
UP
RIGHT
LEFT
DOWN
RIGHT
DOWN
LEFT
P2
DOWN
DOWN
LEFT
DOWN
RIGHT
UP LEFT
UP
RIGHT
UP
RIGHT
UP LEFT
P2 LEFT
P2
RIGHT
DONE/
RESTART
MC
P2_MIN
Figure 6.1: Motion estimation program flowchart
The MC part in the final stage of figure 6.1 is as mentioned only included in
the final block but all other stages are common to all blocks. The computations
performed by the different functions in figure 6.1 are depicted in figure 6.2.
The Finite State Machine (FSM) included in figure 6.2 is the state machine
presented in figure 6.1. Here “SAD calculating functions” are e.g. LEFT, RIGHT,
UP LEFT and “Min functions” are e.g. MIN, P2 MIN, MIN LEFT, MIN UP
RIGHT.
6.1 Motion Estimation
1. Check Odd/Even Flag
2. Calculate Data Adress
MC Calculating
Functions
EVEN
MC
ODD MC
Store
Done or
Next MB
43
SAD Calculating
Functions
FSM
START/RESTART
1. Init loop counter,
motion vector and
adress registers
2. Copy data MB to
other LVM
Min
Functions
1. Check if out of bounds
2. Check Odd/Even Flag
3. Calculate Data Adress
ODD
1. Find Min Value
2. Update Odd/Even Flag
3. Update Base Adress
EVEN
Sum and
store
Out of
bounds
Figure 6.2: Motion estimation computational flowchart
When the block starts it first sets the loop counter to zero, sets the motion
vector to (15,15) (to start in the middle) and initializes the addresses used to
access the reference macroblocks stored in the Local Vector Memory (LVM). Then
the current data block is copied from the input LVM to the other LVM to be
able to perform calculations easier by accessing one memory for the reference and
the other for the data. By comparing the motion vector that corresponds to the
current search points position and the minimum and maximum values allowed, 0
and 30 respectively, search points that are out of bounds can be detected. If the
search point is found to be out of bounds the calculation of that macroblocks SAD
will not take place and the program will continue to the next search point.
In figure 6.2 ODD and EVEN are the computational functions which use the
new instructions HVBSUMABSDWA and HVBSUMABSDNA to calculate the Sum of Absolute Difference (SAD) of the macroblocks. After that the results are summed up
to a single integer value and stored in memory on one of the 11 (7 + 4) addresses
dedicated for the search points in the large and small hexagon search pattern. The
MIN and P2 MIN functions can then find the smallest value of the costs previously
stored in memory on a part of the specific addresses mentioned above. MIN for
instance examines the first 7 costs from the large hexagon pattern and P2 MIN
the final 4 search points plus the middle point again. Once the minimum value is
found it will be known if the ODD/EVEN flag will have to be updated or not. If
the search point moves an even number of pixels the flag will be unchanged, if it
moves an odd number of pixels the flag will be inverted to indicate the change.
The MC calculation is very similar to the SAD calculation, the difference is
that it can not be out of bounds and that the result is one complete macroblock
of the residue which consists of 32 vectors of 16-bit integers. Once the MC calculation is finished the execution will either finish or restart calculations on the next
macroblock.
44
Implementation
Simple Flow Control
The simple program flow controller was implemented with a series of conditional
jump instructions and a status flag stored in memory to move between functions
in a correct order as depicted in figure 6.3.
START
MID
JUMP FORWARD > 7
JUMP FORWARD > 4
LEFT == 1
RIGHT == 2
UP LEFT == 3
UP RIGHT == 4
DOWN LEFT == 5
DOWN RIGHT == 6
MIN == 7
JUMP FORWARD > 11
MIN MID == 8
MIN LEFT == 9
MIN RIGHT == 10
MIN UP LEFT == 11
JUMP FORWARD > 14
MIN UP RIGHT == 12
MIN DOWN LEFT == 13
MIN DOWN RIGHT == 14
P2 UP == 15
P2 DOWN == 16
P2 LEFT == 17
P2 RIGHT == 18
P2 MIN == 19
DONE
Figure 6.3: Hexagon search program flow controller
The status flag is updated by each function to enable execution of the next
function in order. To enable the ability to only recalculate the three necessary
positions other flags are set, by the corresponding MIN functions, for each position
if it should be recalculated in the next pass or not. The block was verified to work
as intended by comparing the produced results with the results produced by the
Python motion estimation reference program described in section 6.1.1.
Advanced Flow Control
When the block using the simple program flow control was functioning it became
more clear that implementing functionality for call and return in the slaves could
yield an increase in performance. The implementation of a relatively simple hardware stack with only 4 levels as shown in figure 6.4 was the result. In figure 6.4
the original programflow control consists of the blocks inside the dashed box. The
additional hardware added by call and return is the Call / Return Controller block
and the 4 address registers it uses. The added hardware cost of these parts is very
reasonable, the increase in program flow controlability and the performance gain
that follows makes this a well worth addition.
6.1 Motion Estimation
45
PC
Return Addr 1
Return Addr 2
Return Addr 3
PM
Pipeline Reg
Call /
Return
Controller
Return Addr 4
PC FSM
Instr.
Decoder
Figure 6.4: Proposed implementation of call and return hardware
Once functionality for call and return instructions were added to the simulator
a new hexagon search block that utilizes these new features was written. With
the call and return functionality the somewhat primitive program flow controller
could be replaced with function calls. The program flowchart in figure 6.1 is still
valid for blocks using the advanced program flow control. This is because the
functionality of the blocks has not changed.
Multiple Macroblocks
Once the call and return version of the block was completed and confirmed as
working, further development was based on making each Sleipnir core perform
calculations on several macroblocks. The number of macroblocks to calculate the
motion vectors for during one Sleipnir block execution were chosen to 5 and 13
because they both divide 65 evenly. Two block versions, one for 5 and one for 13
macroblocks, were implemented and tested. One of the most substantial benefits
of doing multiple macroblock calculations at a time is the opportunity to exploit
data reuse.
1
1
2
3
4
5
2
Figure 6.5: Reference macroblock overlap
For each extra macroblock beyond the first an amount of data transfer equal
to 6 macroblocks (6 ∗ 16 = 96 vectors) can be saved. The reason for this is that a
3 × Height search area avoids a vertical overlap, the shaded areas, as depicted in
figure 6.5.
46
Implementation
The Sleipnir block calculating 13 motion vectors during each execution needs
a data input equal to the 13 data macroblocks but also the search area for them
which makes out an 3 × 15 area of macroblocks. There is still a considerable
horizontal overlap in this setup but the advantage over calculating one macroblock
per execution by transferring each data macroblock and its 3 × 3 macroblocks of
search area is considerable.
Reference Frame in Main
Memory
Reference Columns
Sleipnir 0
Sleipnir 1
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
1
2
3
4
5
1
2
1
2
3
119 120 121 122 123
119 120 121 122 123
119 120
119 120 121
Data Overlap
Figure 6.6: Reference macroblock partitioning for 13 data macroblocks
In figure 6.6 the data partitioning of a frame in the main memory where only
the upper right corner of the full frame is shown. The numbered areas illustrate the
overlay of the data macroblocks being calculated by each Sleipnir block execution.
The data macroblocks are taken from the current frame not the reference frame
shown in figure 6.6. As the frame contains 118 columns to be calculated the next
row of 13 macroblocks starts as number 119.
6.1 Motion Estimation
47
Motion Compensation
Once a motion estimation block has found the best match only a little extra time
would be needed to calculate the motion compensated residue of that macroblock.
By taking advantage of this and adding this functionality to the motion estimation block motion compensation can be achieved for a very low overhead cost as
all information needed is already present. To perform motion compensation the
block, once it has found its best match, has to perform a subtraction between two
macroblocks. This adds up to 256 subtraction operations or 32 vector subtraction
instructions in the Sleipnir. The result will be stored as 32 vectors of 8 16-bit
integers and copied back to the main memory along with the motion vectors.
As mentioned in section 6.1.3 the motion compensation block uses the HVBSUBWA
and HVBSUBNA instructions to speed up the calculation of the residue macroblocks.
6.1.4
Master Code
The implementation of the master code was started once the Sleipnir code was
found to be working. The masters tasks include keeping track of how many more
macroblocks to do calculations on, set up all DMA data transfers to and from the
Sleipnir cores and divide the workload of the motion estimation between them.
In figure 6.7 the program flow of the master is shown. In the prolog the stack
pointer, DMA- and slave-interrupts, number of macroblocks to compute, address
registers and configurations of data storage in the main memory are set up. In the
prolog the master also loads the program and constants into the Sleipnir cores’
memories.
Configure
DMA
Prolog
Start DMA
Start Sleipnir
Copy Results
Find Next
Available
Sleipnir
YES
Epilog
NO
More MBs to
Compute?
Figure 6.7: Master program flowchart
In the Configure DMA stage the coming DMA transfers for data and reference
are configured. The addresses for where the result should be written is saved to
DM0 at the label called Results which can be found in figure 6.8a.
In the Start DMA stage the DMA transfers are started and the addresses for
the location of the next data block is calculated. There are two DMA transfers
that are completed and therefore a wait for the first to finish is performed. During
this wait the calculation of the next address is hidden. These addresses are saved
to DM0 in the RAM block DMA data and DMA ref for easier configuration of
DMA when reaching the Configure DMA step next time.
48
Implementation
In the Start Sleipnir stage the switches of the memory are set in the correct
place before starting the Sleipnir core. After the Sleipnir core has been started it
is time to find a new available Sleipnir core to fill with data and start.
The Find Next Available Sleipnir stage iterates over the Sleipnirs until it finds
one that is free. The iteration in the Find Next Available Sleipnir stage gives the
ability to rather quickly find any Sleipnir that has finished execution. Sleipnir
cores are chosen in an first free first served fashion so that Sleipnir 0 will have
the highest priority and Sleipnir 7 will have the lowest priority. The program uses
status flags for each core to know what it is currently doing. These flags are used
when finding a new free core.
If the Sleipnir block is running and then finishes it will send an interrupt to
the master which will change the status flag for the core in an interrupt routine.
Next time the master is looking for a free Sleipnir it will find that the Sleipnir has
finished, due to the flag value, and needs to have its results copied back.
When a Sleipnir core has finished execution the results from that core needs
to be copied back to the main memory. This is done in Copy Results stage.
The information where the results should be copied to are fetched from DM0 and
written to the Copy Back (CB) allocated memory DMA CB seen in figure 6.8a.
The DMA unit is then configured with the information that can be found in DMA
CB. When the DMA transfer is finished it is time to load the Sleipnir with new
data to perform calculations on. This will of course happen in Configure DMA
and Start DMA.
Using this type of setup of the program enables out of order execution of the
Sleipnir cores which is suitable for a search algorithm such as motion estimation
that has a highly variable execution time.
If there are more macroblocks available for calculation the loop will continue
to the configure DMA stage again. If all macroblocks are finished the Epilog is
activated.
In the Epilog the master waits for all Sleipnirs to finish their executions. The
last results are then copied to main memory and the kernel is finalized.
DM 0
Main Memory
RAM 0
ROM 0
DMA Data
DMA PM
Data
(current
frame)
DMA CB
DMA CM
Refence
Frame
DMA Ref
Results
Results
PM
CM
(a)
(b)
Figure 6.8: Memory allocation of data memory in the master(a) and main memory
allocation(b)
6.2 Discrete Cosine Transform and Quantization
49
The memory allocation for the masters data memory can be found Figure 6.8a.
It is a simple setup where three blocks of DMA configuration setting are stored
in the top of RAM 0. During runtime the master points the DMA firmware to
the different memory blocks where it reads the DMA settings. The last block in
RAM 0 is used to store result pointers for main memory. These result pointers are
needed because the likelihood of the blocks finishing in an out of order fashion.
The allocation overview of the main memory can be seen in figure 6.8b. Data
and Ref are data for the frames to encode. The results block is memory allocated
for the residues and for motion vectors. The last two blocks of data are memory
allocated for the Sleipnir block and its constants.
Main Memory
Result in Main Memory
Sleipnir ME Execution
Result 3
Sleipnir 1
Result 2
Result 0
Task 4
Task 3
Task 2
Task 1
Task 0
Sleipnir 0
Sleipnir 2
Task 122
Task 121
Task 120
Task 119
Task 118
Sleipnir 3
Sleipnir 4
Time
Current Time
Figure 6.9: Sleipnir core motion estimation task partitioning and synchronization
In figure 6.9 the motion estimation task partitioning and synchronization between Sleipnir cores is shown. Task 0, Task 1 and so on contains the copying of
both the data macroblocks and the reference macroblocks used for the search area
and performing motion estimation on them. The number of macroblocks used
as data macroblocks in the tasks can be 1, 5 or 13 and the number of reference
macroblocks can be 9, 21 or 45 respectively. Shown in the figure is also examples
of the Sleipnir cores execution times and the resulting out of order completion of
the tasks.
6.2
Discrete Cosine Transform and Quantization
The output from the motion compensation block will be a motion compensated
residue which will be the input to the next part of the encoder, the Discrete Cosine
Transform (DCT) and Quantization block.
50
Implementation
6.2.1
Forward DCT and Quantization
The DCT and the quantization were combined into one Sleipnir block to be able
to save cycles by performing the quantization directly after the transform. The
Quantization Parameter (QP) was chosen as a fixed value of 10, this value is easily
changed if another fixed value would be desired. Adding support for a variable
value of QP would cost both additional instructions and additional constants in the
constants memory. To get as low execution times as possible while still following
the H.264 standard a variable QP was left out. The order of computations in the
DCT and quantization block is as follows:
1. Process the blocks through the first DCT stage.
2. Transpose the blocks.
3. Process the blocks through the second DCT stage.
4. Transpose the blocks again.
5. Multiply by MF, scale by qbits and round to get the result.
The calculation of a 4×4 block based two-dimensional DCT as discussed in section
3.5.1 can be described as follows:
1. Calculate X0 .. X3 according to figure 3.9 for each row of the block.
2. Transpose the resulting 4 × 4 block to be able to calculate the DCT of the
columns.
3. Calculate X0 .. X3 according to figure 3.9 for each column of the block.
4. Transpose the resulting 4 × 4 block again to get the final result.
As the block is transposed two times the resulting block will not be transposed
compared to the input block.
The input data is presumed to be stored as 16-bit integers as this is the native
Sleipnir datapath width. The input data itself consists of a number of 4 × 4 blocks
of the residue pixel values. To utilize the full datapath width of the Sleipnir two
4 × 4 blocks can be calculated simultaneously. The flow of the two-dimensional
DCT is depicted in figure 6.10. First the input data consisting of two 4 × 4 pixel
blocks is read in and transformed through the first DCT stage. The result will
be two one-dimensionally DCT-transformed 4 × 4 pixel blocks. The following
transpose of the blocks will be performed as shown in figure 6.11. After that the
transposed blocks will be processed by the second DCT stage and finally the blocks
are yet again transposed to complete the two-dimensional DCT transform.
6.2 Discrete Cosine Transform and Quantization
1
2
3
4
17 18 19 20
+
5
6
7
8
21 22 23 24
+
9
10 11 12 25 26 27 28
-
13 14 15 16 29 30 31 32
-
1
5
9
13 17 21 25 29
+
2
6
10 14 18 22 26 30
+
3
7
11 15 19 23 27 31
+
4
8
12 16 20 24 28 32
+
Second blockwise
transpose
2
3
4
17 18 19 20
5
6
7
8
21 22 23 24
9
10 11 12 25 26 27 28
+
-2
2
+
1
2
3
4
17 18 19 20
+
5
6
7
8
21 22 23 24
+
9
10 11 12 25 26 27 28
+
13 14 15 16 29 30 31 32
First DCT
Input data of two 4x4 blocks of
16-bit integers
1
+
-
-2
2
51
First blockwise
transpose
+
1
5
9
+
2
6
10 14 18 22 26 30
13 17 21 25 29
+
-
3
7
11 15 19 23 27 31
+
-
4
8
12 16 20 24 28 32
Second DCT
Input to the second DCT
Final twodimensional
DCT output
13 14 15 16 29 30 31 32
Figure 6.10: DCT flowchart
Two 4x4 Blocks to be transposed
1
2
3
4 17 18 19 20
5
6
7
8 21 22 23 24
Constant Memory Permutation
Adressing
0
9 18 27 4 13 22 31
9 10 11 12 25 26 27 28
1 10 19 28 5 14 23 32
13 14 15 16 29 30 31 32
2 11 20 29 6 15 24 33
3 12 21 30 7 16 25 34
Transposed Output
0
1
2
3
4 17 18 19 20
1
5
9 13 17 21 25 29
8
x
5
6
7
2
6 10 14 18 22 26 30
16
24 x
3
7 11 15 19 23 27 31
24
27 28 x 13 14 15 16 29
4
8 12 16 20 24 28 32
32
30 31 32 x
8 21 22 23
9 10 11 12 25 26
x
x
Memory Mapping
Figure 6.11: Memory transpose schematic
x
x
52
Implementation
The blocks to be transposed will be stored in memory according to the Memory
Mapping part of figure 6.11 where the data is displaced one address higher for
each new vector stored. The displacement is necessary as only one value can be
read out from each memory bank. In the Local Vector Memories (LVMs) there
are 8 memory banks, one for each column of the memory. This setup enables
the addressing of the memory according to prestored addresses in the Constant
Memory (CM). As the arrows display in the figure the first address vector in CM
is 0, 9, 18, 27, 4, 13, 22 and 31. This vector will fetch the values of pixels 1, 5, 9,
13, 17, 21, 25 and 29 which then can be stored in e.g. a vector register. By using
the memory transpose as shown in figure 6.11 the transpose can be performed in
only 4 vector copy instructions. An excerpt from the first transpose of the Sleipnir
block code is
vcopy
vcopy
vcopy
vcopy
vr0
vr3
vr1
vr2
m1 [ a r 1
m1 [ a r 1
m1 [ a r 1
m1 [ a r 1
+
+
+
+
cm [ACCESS_PATTERN_0_4 ] ] . vw
cm [ACCESS_PATTERN_3_7 ] ] . vw
cm [ACCESS_PATTERN_1_5 ] ] . vw
cm [ACCESS_PATTERN_2_6 ] ] . vw
where ar1 is an address register pointing to the location of data stored in memory
m1 (Memory Mapping), vr0 to vr3 are vector registers and the access patterns are
ACCESS_PATTERN_0_4:
ACCESS_PATTERN_1_5:
ACCESS_PATTERN_2_6:
ACCESS_PATTERN_3_7:
0
1
2
3
9 18 27 4 13 22 31
10 19 28 5 14 23 32
11 20 29 6 15 24 33
12 21 30 7 16 25 34
as also shown in figure 6.11. The particular order of the vector registers in the
excerpt comes from the minimization of data dependency in the following stage of
the Sleipnir block code. Calculating the transpose of the two 4 × 4 blocks in only
4 instructions contributes to a fast DCT.
Once the DCT is completed the final stage of the block performs the quantization. The final quantization formula from section 3.5.3 has the benefit of being
easy to implement in integer arithmetic as the division can be replaced by a shift
operation and MF only consists of integer numbers. The division by 2qbits can
be rewritten as an arithmetic right shift by qbits. Utilizing this the implemented
quantization expression can be written as
Zij = round(Wij M Fij >> qbits)
(6.1)
where >> is the right shift operation. [6]
The quantization was implemented by multiplying the 4 × 4 blocks by the
Multiplication Factor (MF) described in equation (3.19) and table 3.2 in section
3.5.3. This results in 4 vector to vector multiplications between the blocks and
the MF used for the current value of QP. The shift by qbits and rounding was
implemented using scaling and rounding of the multiplication result which is a built
in function of the multiplication instruction. An excerpt from the quantization part
of the Sleipnir block code is
vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr0 cm [MF_QP_10_1]
6.2 Discrete Cosine Transform and Quantization
53
vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr1 cm [MF_QP_10_2]
vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr2 cm [MF_QP_10_1]
vvmul<rnd , s c a l e = 1 6 , s s > m0 [ a r 2 +=8].vw vr3 cm [MF_QP_10_2]
where the values to be quantized are stored in the vector registers vr0 to vr3, ar2
is the address register pointing to the location in memory m0 where data will be
stored and the Multiplication Factors (MF) for QP equal to 10 are
MF_QP_10_1: 8192 5243 8192 5243 8192 5243 8192 5243
MF_QP_10_2: 5243 3355 5243 3355 5243 3355 5243 3355
which are derived from table 3.2 and 3.19 in section 3.5.3. The ability to quantisize
two 4 × 4 blocks in 4 instructions gives a quick quantization.
6.2.2
Rescaling and Inverse DCT
The rescaling and the Inverse DCT (IDCT) were also combined into one Sleipnir
block to be able to save cycles by performing the IDCT directly after the rescaling.
As with the DCT and quantization block only a fixed value of rescaling is supported
to speed up execution while still following the H.264 standard.
The order of computations in the IDCT and rescaling block is as follows:
1. Perform rescaling by multiplication of the blocks and V 0 .
2. Run the blocks trough the first IDCT stage.
3. Transpose the blocks.
4. Run the blocks trough the second IDCT stage.
5. Divide by 64 and round to get the result.
The calculation of the 4×4 block based two-dimensional IDCT can be described
as follows:
1. Calculate x0 .. x3 according to figure 3.10 for each row of the block.
2. Transpose the resulting 4 × 4 block to be able to calculate the IDCT of the
columns.
3. Calculate x0 .. x3 according to figure 3.10 for each column of the block.
4. Transpose the resulting 4 × 4 block again to get the final result.
As the block is transposed two times the resulting block will not be transposed
compared to the input block. To utilize the full datapath width of the Sleipnirs
two 4 × 4 blocks can be calculated on simultaneously.
The first stage of the block performs the rescaling by multiplying the 4 × 4
blocks by the rescaling factors (V) which was described in equation (3.25) and
table 3.3 in section 3.5.4.
The final rescaling formula discussed in section 3.5.4 was
Wij0 = Zij Vij 2f loor(QP/6)
(6.2)
54
Implementation
which like the final quantization formula has the benefit of being easy to implement
in integer arithmetic. [6]
The factor 2f loor(QP/6 causes the output to increase by a factor of two for every
increment of 6 in QP. The factor 2f loor(QP/6) can be incorporated into V, reducing
calculations with at least the calculation of f loor(QP/6) and one multiplication
at the cost of having more constants in memory. As the constant memory is only
read into the Sleipnir core once for each change of block, this was found to be
beneficial, especially if a Sleipnir core will be dedicated to running the IDCT and
Rescaling block. By incorporating the multiplication of 2f loor(QP/6) into V (6.2)
can be rewritten as
Wij0 = Zij Vij0
(6.3)
where Vij0 is Vij with a built in scaling of 2 for every increase of 6 in QP. Note that
the result from the following Inverse DCT has to be rescaled once more to remove
the constant scaling factor of 64 introduced in (3.24) which was also incorporated
in V. This is the formula used in the implementation of the rescaling part of the
block.
An excerpt from the rescaling part of the Sleipnir block code is
vvmul<rnd ,
vvmul<rnd ,
vvmul<rnd ,
vvmul<rnd ,
scale
scale
scale
scale
=
=
=
=
0,
0,
0,
0,
ss>
ss>
ss>
ss>
vr1
vr3
vr0
vr2
m0 [ a r 0
m0 [ a r 0
m0 [ a r 0
m0 [ a r 0
+ 8 ] . vw
+ 2 4 ] . vw
] . vw
+ 1 6 ] . vw
cm [ V_QP_10_2 ]
cm [ V_QP_10_2 ]
cm [ V_QP_10_1 ]
cm [ V_QP_10_1 ]
where ar0 is the address register pointing to the location of the blocks in memory
m0, the vector registers vr0 to vr3 will store the rescaled result and the rescaling
factors (V) for QP equal to 10 are
V_QP_10_1 : 32 40 32 40 32 40 32 40
V_QP_10_2 : 40 50 40 50 40 50 40 50
which are derived from table 3.3 and 3.25 in section 3.5.4. The ability to rescale
the two 4 × 4 blocks in 4 instructions gives a quick rescaling.
The IDCT is implemented much like the DCT described in section 6.2.1. Compared to the DCT the function of the transform stages is changed to those performing IDCT as described in section 3.5.2 but for example the transpose functionality
is still the same. In addition the order of the different stages are changed to the
reversed order of the DCT. The IDCT is followed by a arithmetic shift right by 6
bits which can be written as
X = round(Xr >> 6)
(6.4)
where Xr is the output from the IDCT, >> is the right shift operation and X is
the final output. This final shift gives a division by 64 and removes the constant
scaling factor of 64 which was introduced from Vij0 .
The final stage is done by a vector to vector multiplication using the built in
functionality for scaling and rounding of the multiplication instruction. An excerpt
from the final scaling part of the Sleipnir block code is
6.2 Discrete Cosine Transform and Quantization
vvmul<rnd ,
vvmul<rnd ,
vvmul<rnd ,
vvmul<rnd ,
scale
scale
scale
scale
=
=
=
=
6,
6,
6,
6,
ss>
ss>
ss>
ss>
m1 [ a r 1
m1 [ a r 1
m1 [ a r 1
m1 [ a r 1
+ 9 ] . vw
+ 1 8 ] . vw
] . vw
+ 2 7 ] . vw
55
vr0
vr2
vr4
vr5
cm [ONES]
cm [ONES]
cm [ONES]
cm [ONES]
where ar1 is the address register pointing to the location in memory m1 where the
results should be stored, the vector registers vr0, 2, 4 and 5 contains the Xr values
and ONES is a constant memory vector consisting of only 8 ones according to
ONES: 1 1 1 1 1 1 1 1
as only functionality of the multipliers scaling and rounding is needed.
Chapter 7
Results and Analysis
In this chapter the performance results from the implementations of the kernels
and blocks are presented. The results presented are for the motion estimation,
motion compensation and transform and quantization.
7.1
Motion Estimation
In this section the results from different simulations of motion estimation is presented. The result depends on different properties of the kernel code. Each subsection has a separate description of how the simulation was performed. A total of 5
kernels and 4 video sequences have been tested on 1, 2, 4 and 8 Sleipnir cores. The
results are based on calculations of 7670 macroblocks which means that the edges
of the frame have intentionally been left out. The edges consists of a number of
430 macroblocks. This simplification was done because these macroblocks would
add special cases which could have been solved with for example a message box
to each Sleipnir. Message boxes were not available in the revision of the simulator
that was used. All test sequences were downloaded from [1].
The simulations are all executed with revision 9888 of the ePUMA simulator
with a patch on event.hpp from revision 9958. The patch corrects event IDs for
DMA and Sleipnir cores.
In table 7.1 short names of the kernels under test are presented with a short
description. These names will be used throughout this section. In table 7.2 the
columns of the result tables are described.
Datasent = Searches ∗ ((M Bsref erence + M Bsdata ) ∗ vectors_per_M B) (7.1)
The amount of data sent to the blocks is in all cases calculated according to
equation (7.1). Searches is the total number of searches performed and M Bsref erence
is the number of macroblocks sent to the Sleipnir blocks as reference to be used
as the search area. M Bsdata is the data macroblock(s) and vectors_per_M B is
16 when using a representation of 8-bits per pixel and 32 when using 16-bits per
57
58
Results and Analysis
pixel. In equation (7.1) the data transfer cost of the DMA for programming the
Sleipnirs’ PM and CM is not included.
Kernel Name
Kernel 1
Description
Calculates the motion vector for one macroblock each execution. Program flow control is implemented with jump.
Calculate the motion vector for one macroblock each execution. Program flow control has support for call and return.
Calculates motion vectors for 5 macroblocks each execution.
Program flow control has support for call and return.
Calculates motion vectors for 13 macroblocks each execution. Program flow control has support for call and return.
Calculates motion vectors and motion compensated residue
blocks for 13 macroblocks. Program flow control has support for call and return.
Kernel 2
Kernel 3
Kernel 4
Kernel 5
Table 7.1: Short names for kernels that have been tested
Column name
Core
Number
of
starts
Total cycles
Idling cycles
Runtime idle
Utilization
percent
in
Description
Sleipnir core
Number of times the Sleipnir core have been started with
the specific block
Total number of cycles that the Sleipnir has executed
during simulation
Total number of cycles that the Sleipnir has been idling
during simulation
Number of cycles the Sleipnir have been idling not including idle before first start and after last start
Sleipnir utilization in % based on total number of cycles
executed in the block and total simulated cycles
Table 7.2: Description of table columns
7.1.1
Kernel 1
Results presented in this section are simulations of the Sleipnir block performing
the motion vector calculation of one macroblock, this block is called block 1.
The Sum of Absolute Difference (SAD) calculations are implemented using the
complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section
6.1.2. Program flow control is implemented using the jump instruction. Only
simulations that required the most computational power are presented.
7.1 Motion Estimation
59
Result
Core
Sleipnir 0
Avg. util.
Master
Number
of starts
7 670
Total cycles
30 853 670
Idling cycles
5 574 019
Runtime
idle
5 572 399
Utilization
in
percent
84.7
84.7
36 427 689
Table 7.3: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 1 using 1 Sleipnir core
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
1
1
1
1
220
189
141
086
993
885
698
458
Total cycles
4
4
4
4
4
3
2
1
806
693
546
330
072
591
902
910
328
008
033
691
588
793
438
784
Idling cycles
1
1
1
1
2
2
3
4
363
476
623
838
096
577
266
258
006
326
301
643
746
541
896
550
Runtime
idle
1
1
1
1
1
358
318
257
197
104
984
786
509
873
404
604
861
705
926
886
376
Utilization
in
percent
77.9
76.1
73.7
70.2
66.0
58.2
47.0
31.0
62.5
6 169 334
Table 7.4: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 1 using 8 Sleipnir cores
Block 1
PM
CM
LVM 0
LVM 1
Cost
613 instructions
65 vectors
26 vectors
180 vectors
Table 7.5: Block 1 costs
The best runtime for one block execution was 1 584 and the worst was 10 386
cycles in both simulations. The amount of data that was sent to the blocks was
7 670 ∗ ((9 + 1) ∗ 16) = 1 227 200 vectors (18.73 MByte) calculated according to
60
Results and Analysis
equation (7.1). 7670 vectors (0.12 MByte) was copied back to main memory from
the blocks. Before any calculations can begin a prolog is executed to copy vectors
to a second memory and to set up address registers. This prolog is 31 cycles in
block 1. After the search has finished a epilog is executed which takes 8 cycles.
Analysis
Kernel 1 was the first working kernel and a proof of concept. The DMA configurations performed by the master is written in such a way that Sleipnir 0 has the
highest priority and Sleipnir 1 has second priority an so on. This is the reason
why Sleipnir 7 has a lower utilization compared to for example Sleipnir0. With
this kernel it was found that a lot of cycles in the Sleipnir block were spent on
state handling and extra overhead caused by the jump instruction that can only
jump to immediate addresses.
7.1.2
Kernel 2
Results presented in this section are simulations of kernel 2. This kernel uses
an improved version of block 1,called block 2. Block 2 calculates the motion
vector of one macroblock each execution. The Sum of Absolute Difference (SAD)
calculations are implemented using the complex instructions HVBSUMABSDWA and
HVBSUMABSDNA as discussed in section 6.1.2. In block 2 hardware support for call
and return has been added to the simulator and this is utilized for program flow
control.
Result
Core
Sleipnir 0
Avg. util.
Master
Number
of starts
7 670
Total cycles
16 933 783
Idling cycles
5 572 953
Runtime
idle
5 571 478
Utilization
in
percent
75.2
75.2
22 506 736
Table 7.6: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 2 using 1 Sleipnir core
The results from simulation with riverbed video sequence are presented in table
7.6 and 7.7. The best runtime for one block 2 execution was 986 and the worst was
5 348 cycles in both simulations. The amount of data that was sent to the blocks
was 7670 ∗ ((9 + 1) ∗ 16) = 1 227 200 vectors (18.73 MByte) and 7 670 vectors (0.12
MByte) was copied back as the result from the blocks. Before any calculations
can begin a prolog is executed to copy vectors to a second memory and to set up
address registers. This prolog is the same as in block 1 and therefore takes 31
7.1 Motion Estimation
61
cycles. After the search has finished a epilog is executed which also, as block 1,
takes 8 cycles.
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
1
1
1
1
755
678
562
358
910
368
38
1
Total cycles
3
3
3
3
2
848
652
393
004
069
867
95
2
831
381
931
121
282
747
305
178
Idling cycles
2
2
2
2
3
4
5
5
003
199
458
848
783
984
757
850
542
992
442
252
091
626
068
195
Runtime
idle
2
1
1
1
1
000
909
770
542
035
417
41
1
872
706
069
502
091
652
958
181
Utilization
in
percent
65.8
62.4
58.0
51.3
35.4
14.8
1.6
0.0
36.2
5 852 373
Table 7.7: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 2 using 8 Sleipnir cores
Block 2
PM
CM
LVM 0
LVM 1
Cost
442 instructions
65 vectors
26 vectors
180 vectors
Table 7.8: Block 2 costs
Analysis
In block 2 an improvement of 37.8% in best execution time can be seen compared
to block 1. There is also an improvement of 48.5% in the worst execution time
compared to block 1. This improvement is significant and should lower the total
execution time of one frame but as can be seen the total execution time is only
improved by 5.1%. The explenation is that the average utilization of the Sleipnir
cores have decreased from 84.7% to 75.2% in the simulation of 1 Sleipnir and 62.5%
to 36.2% in the simulation with 8 Sleipnirs. In table 7.7 it can be seen that the
utilization of Sleipnir 7 is 0.0%. This indicates that the blocks are executing to
few cycles in the Sleipnirs or that the master is too slow and does not feed the
Sleipnirs with enough data. Targeting the master code does not offer too many
opportunities of optimization and the complexity of the code has not yet reached
the complexity of a complete encoder. It was therefore concluded that searching
of more macroblocks in the block should be investigated. Table 7.8 shows the
62
Results and Analysis
memory cost of block 2.
7.1.3
Kernel 3
Results presented in this section are simulations of kernel 3. This kernel uses a
Sleipnir block called block 3. Block 3 is a further development of block 2 where
a wrapper that handles looping has been added. Block 3 calculates the motion
vectors of 5 macroblocks during each execution. As in kernel 2 the Sum of Absolute
Difference (SAD) calculations are implemented using the complex instructions
HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2.
Result
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
198
195
193
194
191
190
187
186
Total cycles
2
2
2
2
2
2
2
2
226
235
218
187
152
175
135
139
674
280
086
261
389
113
931
942
Idling cycles
273
265
282
313
348
325
364
360
956
350
544
369
241
517
699
688
Runtime
idle
264
264
256
255
253
252
250
247
792
586
840
665
583
341
903
395
Utilization
in
percent
89.0
89.4
88.7
87.5
86.1
87.0
85.4
85.6
87.3
2 500 630
Table 7.9: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 3 using 8 Sleipnir cores
Block 3
PM
CM
LVM 0
LVM 1
Cost
478 instructions
67 vectors
26 vectors
444 vectors
Table 7.10: Kernel 3 costs
The results from simulation with riverbed video sequence are presented in table
7.9. The best runtime for one Sleipnir block execution was 5 866 and the worst was
19 416 cycles. The amount of data that was sent to the blocks were (7 670/5)∗((3∗
7 + 5) ∗ 16) = 638 144 vectors (9.74 MByte) and 7 670 vectors (0.12 MByte) were
7.1 Motion Estimation
63
copied back as the result from the blocks. The prolog in block 3 is slightly larger
than in block 2, it is now 46 cycles. The epilog has also increased and now takes
83 cycles. Between the calculations on each macroblock there is an intermission
that takes 43 cycles to finish. This intermission changes offsets for memory reads
and copies a new macroblock to the second memory.
Analysis
Kernel 3 resulted in a 57.3% improvement of total simulation time on execution
on 8 Sleipnirs compared to kernel 2. The utilization has increased to over 85%
in Sleipnir 7 which is more acceptable. The wrapper introduced in block 3 only
required 36 instructions extra compared to block 2. The increase in LVM memory
needed is for storage of the 16 extra macroblocks, 4 more motion vectors and extra
overhead from e.g. the added loop counter.
7.1.4
Kernel 4
Results presented in this section are simulations of kernel 4. This kernel uses
a Sleipnir block called block 4. Block 4 is the next step of improvement of the
Sleipnir blocks and it calculates 13 motion vectors during each execution. As in
block 2 and 3 the Sum of Absolute Difference (SAD) calculations are implemented
using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in
section 6.1.2.
Result
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Avg. util.
Master
Number
of starts
149
146
148
147
Total cycles
4
4
4
4
370
359
377
345
096
195
214
295
Idling cycles
262
272
254
286
074
975
956
875
Runtime
idle
253
250
249
250
825
107
644
289
Utilization
in
percent
94.3
94.1
94.5
93.8
94.2
4 632 170
Table 7.11: Motion estimation results from simulation with Riverbed frame 10
and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores
The results from simulation with riverbed video sequence are presented in table
7.12 and 7.11. The best runtime for one Sleipnir block execution was 18 057 and
the worst was 42 896 cycles in both simulations. The amount of data that was
sent to the blocks was (7 670/13) ∗ ((15 ∗ 3 + 13) ∗ 16) = 547 520 vectors (8.35
MByte) and 7 670 vectors (0.12 MByte) was copied back as the result from the
64
Results and Analysis
blocks. Block 4 has the same prolog, intermission and epilog cycle cost as block 3
i.e 46, 43 and 83 cycles.
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
75
74
74
75
75
73
72
72
Total cycles
2
2
2
2
2
2
2
2
214
200
194
191
191
174
127
155
930
402
769
467
754
953
822
699
Idling cycles
146
160
166
169
169
186
233
205
170
698
331
633
346
147
278
401
Runtime
idle
130
129
129
130
132
127
125
127
223
794
572
113
144
130
152
116
Utilization
in
percent
93.8
93.2
93.0
92.8
92.8
92.1
90.1
91.3
92.4
2 361 100
Table 7.12: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 4 using 8 Sleipnir cores
Block 4
PM
CM
LVM 0
LVM 1
Cost
478 instructions
67 vectors
26 vectors
964 vectors
Table 7.13: Kernel 4 costs
Analysis
Kernel 4 pushes the utilization up to over 90% in every Sleipnir. The total simulation time has decreased from 2.50 Mega cycles (Mc) to 2.36 Mc which results in an
improvement of 5.6% percent. Kernel 4 only copies 85.8% of the data compared
to kernel 3. This decrease in memory data transfers will help later when the whole
encoder is implemented. The cost of local memory used in the Sleipnir block has
increased by 520 vectors compared to block 3.
7.1 Motion Estimation
7.1.5
65
Kernel 5
Results presented in this section are simulations of kernel 5. This kernel uses a
Sleipnir block called block 5. Block 5 uses the same motion estimation code as
block 4 where as before the Sum of Absolute Difference (SAD) calculations are
implemented using the complex instructions HVBSUMABSDWA and HVBSUMABSDNA as
discussed in section 6.1.2. Added to block 5 is code for calculating the motion
compensated residue macroblock which is done using the HVBSUBWA and HVBSUBNA
instructions as discussed in section 6.1.2. The benefit of doing this in the same
Sleipnir block is that all extra overhead for moving data to another kernel is
avoided. In this part simulation results from 4 different video sequences are presented to highlight that there is a difference in total simulation time depending on
the data that is fed to the Sleipnir blocks.
Result
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
74
75
74
73
74
74
73
73
Total cycles
1
1
1
1
1
1
1
1
783
806
799
766
787
798
776
765
865
053
925
940
586
046
852
463
Idling cycles
198
176
182
215
195
184
205
217
721
533
661
646
000
540
734
123
Runtime
idle
175
178
176
173
178
178
176
173
131
070
137
459
102
102
123
901
Utilization
in
percent
90.0
91.1
90.8
89.1
90.2
90.7
89.6
89.0
90.1
1 982 586
Table 7.14: Motion estimation results from simulation on Sunflower frame 10 and
Sunflower frame 11 with kernel 5 using 8 Sleipnir cores
The results from simulation with sunflower video sequence are presented in table
7.14. The best runtime for one Sleipnir block execution was 18 457 and the worst
runtime was 28 039 cycles. In table 7.15 simulation on blue sky video sequence is
done which resulted in a best runtime for one Sleipnir block execution of 18 079
cycles and a worst runtime of 41 415 cycles.
66
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Results and Analysis
Number
of starts
74
75
74
73
73
75
74
72
Total cycles
1
1
1
1
1
1
1
1
945
954
929
919
926
926
926
895
498
782
502
413
889
105
972
777
Idling cycles
195
186
211
222
214
215
214
245
944
660
940
029
553
337
470
665
Runtime
idle
170
167
170
165
164
171
172
168
596
994
267
579
997
946
619
488
Utilization
in
percent
90.8
91.3
90.1
89.6
90.0
89.9
90.0
88.5
90.0
2 141 442
Table 7.15: Motion estimation results from simulation on Blue sky frame 10 and
Blue sky frame 11 with kernel 5 using 8 Sleipnir cores
The third simulation was done with pedestrian area clip and the results can be
found in table 7.16. The best runtime for one Sleipnir block execution was 15 378
cycles and the worst was 47 611 cycles.
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
79
75
73
75
71
73
73
71
Total cycles
2
2
2
2
2
2
2
2
089
071
071
049
080
058
043
031
641
619
989
072
353
428
657
873
Idling cycles
193
211
211
234
203
225
239
251
948
970
600
517
236
161
932
716
Runtime
idle
184
171
168
177
164
171
171
164
881
855
548
366
579
021
602
102
Utilization
in
percent
91.5
90.7
90.7
89.7
91.1
90.1
89.5
89.0
90.3
2 283 589
Table 7.16: Motion estimation results from simulation on Pedestrian area frame
10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores
7.1 Motion Estimation
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Avg. util.
Master
Number
of starts
147
147
148
148
67
Total cycles
4
4
4
4
659
628
616
629
746
572
250
588
Idling cycles
315
346
359
345
540
714
036
698
Runtime
idle
311
310
313
315
485
410
279
330
Utilization
in
percent
93.7
93.0
92.8
93.1
93.1
4 975 286
Table 7.17: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 5 using 4 Sleipnir cores
Core
Sleipnir 0
Sleipnir 1
Sleipnir 2
Sleipnir 3
Sleipnir 4
Sleipnir 5
Sleipnir 6
Sleipnir 7
Avg. util.
Master
Number
of starts
75
74
73
75
74
75
72
72
Total cycles
2
2
2
2
2
2
2
2
333
349
331
323
331
305
299
260
516
169
312
308
564
196
853
234
Idling cycles
207
191
209
217
209
235
240
280
336
683
540
544
288
656
999
618
Runtime
idle
188
187
185
189
188
186
184
180
358
756
101
070
828
202
476
283
Utilization
in
percent
91.8
92.5
91.8
91.4
91.8
90.7
90.5
89.0
91.2
2 540 852
Table 7.18: Motion estimation results from simulation on Riverbed frame 10 and
Riverbed frame 11 with kernel 5 on 8 Sleipnir cores
The last simulation was done on the riverbed video sequence and the results
can be found in table 7.18. The best runtime for one Sleipnir block execution
was 19 884 cycles and the worst was 44 739 cycles in the simulations presented
in table 7.18 and table 7.17. The amount of data that was sent to the Sleipnir
blocks was (7 670/13) ∗ ((15 ∗ 3 + 13) ∗ 16) = 547 520 vectors (8.35 MByte) and
7 670 ∗ 33 = 253 110 vectors (1.99 MByte) was copied from the blocks. The prolog
cost for block 5 is 46 cycles and intermission cycle cost is 43 which is the same as
in block 4. The epilog in block 5 is 185 and comes from the time it takes to save
the motion compensated residue to local vector memory.
68
Results and Analysis
Block 5
PM
CM
LVM 0
LVM 1
copy LVM→VR
copy CM→VR
Cost
574 instructions
64 vectors
26 vectors
1411 vectors
34 instructions
23 instructions
Table 7.19: Kernel 5 costs
Analysis
The difference from block 4 can easily be seen in table 7.19 where the memory
cost has increased a lot due to the extra vectors needed for storing the motion
compensated residues. This also causes a need for extra data to be copied back
to main memory which increases the runtime idle. The differences can be seen if
comparing table 7.12 and table 7.18. The cycle cost from kernel 5 is not based
on complete full HD frames which was mentioned in the beginning of the chapter.
Equation (7.2) is a calculated approximation of this increased cost when the input
data is a complete full HD frame.
N umber of M B = 8 100
N umber of M B in kernel 5 = 7 670
8 100
Pinc =
= 1.06
7 670
T otal cycle cost = 2 540 852 ∗ Pinc = 2 693 304
(7.2)
The number of copy instructions from one of the LVMs and from the CM
to the Vector Register (VR) are listed in table 7.19. These copies do not add
any compuational functionality and are therefore not desirable. For block 5 these
numbers are rather low which indicates that not too much unnecessary copying
is done. Some of these copies are used to speed up the block, for example by
pre-loading a value to the vector register instead of reading it from the CM the
execution of the instruction using it will finish faster.
7.1 Motion Estimation
7.1.6
69
Master Code
The master code is used when testing the motion estimation blocks also known as
block 1, block 2, block 3, block 4 and block 5.
Program Memory Costs
The master codes used in kernel 1, 2, 3, 4 and 5 are slightly different but have the
same size. The simulations done with 1, 2, 4 and 8 Sleipnirs differs some in code
size caused by removal of code only used when keeping track of more Sleipnirs.
The DMA firmware that was used is not included in the statistics for the master
but instead it got a row of its own because it is included in all the kernels.
Description
Master with 1 Sleipnirs
Master with 2 Sleipnirs
Master with 4 Sleipnirs
Master with 8 Sleipnirs
DMA Firmware
Code size
326
363
437
585
272
RAM
58
60
64
72
0
ROM
16
16
16
16
0
Table 7.20: Master code cost
In table 7.20 the column Code size is measured in number of instructions and
column RAM and ROM is measured in words. Table 7.20 shows that the DMA
Firmware does not use any memory. That is not really the case but instead the
memory cost has been included in the master code costs so that it is not counted
twice. Worth to mention is that the DMA firmware was not written by the authors
and therefore got its own row for the cost. The ROM contains information for the
DMA pointing to the addresses where the program memory and constant memory
for the Sleipnirs are stored. The instruction costs of the master have not been a
target for optimization and therefore it can provide room for improvement. The
reason for not optimizing the cost nor code is that it will not be used in exactly this
way when used in a complete encoder. In the main memory data for 2 complete full
HD frames with 4:2:0 sampling and the size of the Sleipnirs program memory and
constant memory is allocated. The master also allocates memory for the resulting
motion vectors and motion compensated residue blocks.
Prolog and Epilog
Before any calculations can be done the environment has to be configured in the
processor. Table 7.21 introduces this cost. The cycle count of the long prolog
includes configuration of stack pointer, interrupt handling, set up of registers,
programming the Sleipnir cores PM and CM and data copying to LVM. This can
also be described as the cycle count until the first Sleipnir core is started. The
short prolog is the same as the long prolog except that the data copying to LVM
has been excluded.
70
Results and Analysis
Task
Prolog short
Prolog long
Epilog
Cycles
929
38 262
277
Table 7.21: Prolog and epilog cycle costs
When all calculations have been performed an epilog is initiated. This epilog
is for the finalization of the kernel. In this case it empties the last Sleipnir core
of calculation results. The cycle count of waiting for the last Sleipnir to finish
has not been included because it is dependent on which data the calculations are
performed upon.
Video Sequence
Blue Sky
Sunflower
Pedestrian Area
Riverbed
Epilog cycles
25 955
23 169
22 350
31 969
Table 7.22: Simulated epilog cycle cost including waiting for last Sleipnir to finish
The epilog cycle costs including waiting for the last kernel to finish has been
measured and the results are presented in table 7.22. Results from the 4 different
video sequences can of course be worse than in table 7.22 but a better understanding of the cycle cost can be achieved. All results are from simulations with 8
Sleipnir cores. As can be seen in table 7.22 the riverbed simulation had the longest
epilog execution time. This can vary and is not necessarily related to the overall
computational load of the frame. The last parts to be motion estimated will be the
down-right corner of the frame. One of these columns of 13 macroblocks will likely
be the last column a Sleipnir has to process in the end. This will be the Sleipnir
that the master has to wait for. If there are a lot of motion in the down-right
corner of the frame the calculation of the motion vectors will need more cycles to
finish and the epilog will therefore cost more cycles.
DMA
To initiate a DMA transfer the DMA module needs to be configured and the
transfer needs to be started. The DMA firmware provides subroutines to do this.
Table 7.23 provides DMA costs from kernel 5.
In table 7.23 transfer cost for search data is 760 cycles. The observant reader
notice that kernel 5 only should need 720 cycles to transfer 720 vectors. The
measurement is done in such a way that all extra penalties are included which
means that cycle cost for interrupt and return are included. The transfer time is
therefore longer than expected.
7.1 Motion Estimation
71
Task
Loading Sleipnir PM
Configure
Start
Transfer block 5
Loading Sleipnir CM
Configure
Start
Transfer block 5
DMA Firmware
Configure search data
Configure results
Start search data
Start results
Transfer search data, block 5
Transfer MB search for, block 5
Transfer results, block 5
Cycles
41
39
666
41
44
106
75
59
62
62
760
250
55
Table 7.23: DMA cycle costs
The cost for copying the Sleipnir PM and CM are also presented in table
7.23. These costs are included in the prolog of the program. The total cost of the
prolog can be found in table 7.21. Considering this when implementing a complete
encoder, decisions can be taken whether to distribute the different blocks between
different cores or if the cores are going to be loaded with a new block between
tasks.
Table 7.23 also shows three different data transfers. At least two is needed, one
for filling a LVM with data and one for emptying the LVM. Reason for using three
different transfers was that an easier memory allocation scheme in main memory
could be used.
To gain a better utilization of the Sleipnir cores the master needs to start them
and have them running as much as possible. One way to increase utilization is
to try doing as much as possible during DMA transactions. The master that was
used for simulating kernel 1 to 5 did not offer that much opportunity to hide
cycles during DMA transfer. During transaction of search data 98 cycles could
be executed and during transfer of the macroblocks to search for 22 cycles could
be executed. When the results is copied back to main memory 0 cycles could be
saved.
7.1.7
Summary
The results from kernel 2 can be compared with the cycle cost of the H.264 encoder
for the STI Cell processor which can be found in [15] and [11]. There the cycle
cost of performing a hexagon search on a macroblock of 16 × 16 pixels in a (15,15)x(-15,15) search area is listed. This corresponds to the same functionality
72
Results and Analysis
as was implemented in Sleipnir blocks 1 through 4. The listed cycle costs for the
best and worst case searches are 1 451 and 3 609 cycles respectively.
These results can be compared to the best and worst runtime for kernel 2 for
running motion estimation on a macroblock of 16 × 16 pixels in a (-15,15)x(-15,15)
search area for the riverbed video sequence. Kernel 2 is used as it is possible to
measure each search separately and the functionality is still the same. The results
for the best and worst runtimes were 986 and 5 348 cycles. This shows that
the best case runtime is substantially shorter for the ePUMA implementation.
However the worst case is on the other hand substantially better for the STI Cell
implementation. Block 2 still offers room for improvement. The low runtime of the
best case shows that the search and overhead needed could be optimized further
for long searches to reach better worst case performance.
Scalability
Tabell1
40000000
35000000
30000000
25000000
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 5
20000000
15000000
10000000
5000000
0
0
1
2
3
4
5
6
7
8
9
Figure 7.1: Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed
Figure 7.1 shows a graph of the scaling from 1 core to 8 cores for the 5 different
kernels. It can be seen that the scaling is almost linear on simulations with kernel
4 and 5. This shows that the master can fully utilize the extra cores to speed up
calculations. The simulation results that the graph is created from can be found in
appendix B. In the same table simulation results from pedestrian area, sunflower
and blue sky can be found. The reason for better scaling in kernel 4 and 5 is
due to the fact that more calculations and therefore cycles are performed during
each execution of a Sleipnir core. By doing more cycles in the kernel the master
has more time to provide the other 7 cores with data between two executions of a
Sleipnir core. The processor is utilized the best when all Sleipnir cores are running
simultaneously. The easiest way to increase speed further would be either to try
optimizing kernel 5 even more or try writing code for a master that utilizes the
7.1 Motion Estimation
73
third local vector memory that is connected to the DMA. Utilizing the LVM will
make it possible to hide more DMA cycles and by that increase utilization of the
Sleipnir cores which will result in a faster total execution time.
Energy Reduction Results
Figure 7.2: Frame 10 from Pedestrian Area video sequence
Figure 7.3: Difference between frame 10 and frame 11 in Pedestrian Area video
sequence
To see the real difference between ordinary residue calculation and motion compensated residue calculation both will be presented as well as the differences between
them. Figure 7.2 is frame 10 from pedestrian area video sequence and shows the
74
Results and Analysis
back of a person in the center of the image and a lot of moving people on the
street in the background. Figure 7.3 presents the residue between frame 10 and
frame 11. The white areas in the picture indicates big differences between the
two frames. Darker areas in the figure indicates better matching between the two
frames.
60
50
40
30
20
10
0
20
40
60
80
100
Figure 7.4: Motion vector field calculated by kernel 5 on frame 10 and 11 of the
Pedestrian Area video sequence
Figure 7.5: Difference between frame 10 and frame 11 in Pedestrian Area video
sequence using motion compensation
7.2 Transform and Quantization
75
In figure 7.4 the calculated motion vectors are shown. The motion vectors
illustrates how the macroblocks have been estimated to move. Areas that move
very little or not at all have short or no motion vectors resulting in the area being
very bright while areas with longer motion vectors show up as darker areas. The
illustrated effect of motion compensation can be seen in figure 7.5. More darker
areas and less brighter areas are visible which implicates that the residue is smaller
and will need less space after compression.
The improvements can not only be seen but can also be proven by the numbers.
By using equation (2.4) and equation (2.5) introduced in section 2.6 the Peak
Signal to Noise Ratio (PSNR) and Mean Square Error (MSE) can be calculated.
The residue in figure 7.3 has a MSE of 313 and a PSNR of 23.16 dB. The MSE of
the motion compensated residue, illustrated in figure 7.5, has a MSE of 51 and a
PSNR of 31.06 dB which is a significant improvement.
7.2
Transform and Quantization
This section present the results from the Sleipnir blocks DCT with quantization
and IDCT with rescaling.
Results
Two 4x4 DCT + quantization
PM
CM
LVM0
LVM1
copy LVM→VR
copy CM→VR
Two 4x4 IDCT + rescale
PM
CM
LVM0
LVM1
copy LVM→VR
copy CM→VR
Cost
54 instructions
12 vectors
8 vectors
5 vectors
4 instructions
0 instructions
Cost
51 instructions
12 vectors
8 vectors
5 vectors
0 instructions
0 instructions
Table 7.24: Costs for DCT with quantization block and IDCT with rescaling block
The results in table 7.24 is for a fixed value of QP equal to 10. Using a fixed value
of QP gives a fast execution time and low program and constant memory costs.
If a variable QP would be desired this would need some extra program memory
instructions and at least three extra vectors in the constant memory for every
additional QP. The cycle cost for running two 4x4 pixel blocks through the DCT
with quantization block is 72 cycles. The cycle cost for running two 4x4 pixel
76
Results and Analysis
blocks through the IDCT with rescaling block is 69 cycles. This means that if fed
with enough data to transform the DCT with quantization block could transform
and quantisize one full HD frame in 72 ∗ (16/2) ∗ 8 100 = 4 665 600 cycles as
calculated according to equation (7.3).
total_cycles =
cycles ∗ 4x4_blocks_per_M B ∗ M Bs_per_f rame
blocks_calculated_per_execution
(7.3)
The same calculation for the IDCT with rescaling block gives 69 ∗ (16/2) ∗
8 100 = 4 471 200 cycles. These numbers make out a lower limit of execution
cycles needed using the current blocks but still give an approximation of the cycle
costs. A cost not too far from these could be achieved if each block had a dedicated
Sleipnir core that is kept well fed with input data.
These results can be compared with cycle cost in the STI Cell processor which
can be found in [15] and [11]. There the cycle cost of performing DCT, IDCT,
quantization and rescaling are presented. The cost listed to perform DCT on 4
blocks of 4 × 4 pixels is 100 cycles and the cost to perform quantization of 2 blocks
of 4×4 pixels is 96 cycles. The total cost for running four blocks through DCT and
quantization can therefore be summed up to DCT +Quant∗2 = 100+(96∗2) = 292
cycles. In the ePUMA case the cycle cost of performing DCT and quantization of
4 blocks of 4 × 4 pixels sums up to the cost of running the DCT with quantization
block twice. This gives 2 ∗ 72 = 144 cycles which is only 49.3% of the cycle cost
listed for the STI Cell implementation. The listed cost to perform IDCT and
rescaling for 4 blocks in the STI Cell is 2 ∗ Dequant + IDCT = (2 ∗ 88) + 96 = 272
cycles for a QP value less than 24. The cost for performing the same operations in
the ePUMA is equal to the cost of running the IDCT with rescaling block twice
which sums up to 2 ∗ 69 = 138 cycles which is only 50.7% of the cycle cost listed
for the STI Cell.
Analysis
In table 7.24 it can be seen that the DCT and IDCT Sleipnir blocks are fairly
small. It would be possible to make them even smaller and faster by adding more
complex instructions. ePUMA has hardware support for doing a radix-4 butterfly
in one instruction. By modifying this instruction for the purpose of doing H.264
integer DCT this would increase computation speed of the DCT and IDCT. As
seen in both kernel 1 and 2 for motion estimation the utilization became low when
using multiple Sleipnir cores. This was because the master could not provide data
in a satisfying speed. The solution to this was to add support for calculation of
motion vectors for more macroblocks in a single execution. This technique would
of course be beneficial in the DCT and IDCT blocks too.
If even more performance is needed an increase of vector registers could be a
solution. This would make it possible to replace some nop instructions with calculating instructions instead which would increase the throughput. This solution
would be possible even without the proposed complex instruction above.
The number of copy instructions from one of the LVMs and from the CM to
the Vector Register (VR) are listed in table 7.24. These copies do not add any
7.2 Transform and Quantization
77
computational functionality and are therefore not desirable. For the DCT with
quantization and IDCT with rescaling blocks these numbers are very low which
indicates that very little unnecessary copying is done.
Chapter 8
Discussion
This chapter presents a discussion about the ePUMA architecture, improvements
of hardware, considerations when programming the processor and how things were
done.
8.1
DMA
H.264 is a very memory demanding compression algorithm and that is one of the
reasons why it is a challenge to make a real-time encoder. The master that was
implemented used a pre-developed DMA firmware which provided functions to
configure and start all different kinds of DMA tasks. This includes both one- and
two-dimensional memory transfers, broadcasts and Sleipnir to Sleipnir transfers.
This can make the firmware a target for optimization in an encoder. The time
it took to configure and start a new DMA task could in some cases become a
problem if a small chunk of data was to be copied. A message box for these kind
of small communications is therefore useful. At the point the thesis was started
this message system was not available in the simulator and has therefore not been
used.
8.2
Main Memory
The simulator version that was used when implementing the kernels and Sleipnir
blocks did not take the latency of the off-chip main memory into account. The
latency was set to the same access time as the local storage which means that the
results in reality would likely be worse. There are techniques to reduce this off-chip
latency but these delays were not considered in the results as the simulator does
not have support for simulating them.
79
80
8.3
Discussion
Program Memory
For block 5 the program memory size became a problem. The problem is not
critical because the size of the program memory is not yet decided. The aim of
the thesis was to test the ePUMA architectures capabilities in terms of real-time
encoding and therefore speed was more important. Shrinking the block could have
an impact on the blocks performance but that is not necessarily the case. With
new techniques as register forwarding instructions could be removed and the block
would shrink.
8.4
Constant Memory
The constant memory can only address a complete vector at a time. This can
result in problems when a program uses a lot of constants. These constants will
use an entire vector of 128 bits even though only one word is needed. There are two
solutions to this problem. The first solution is to change the addressing of the CM
to be able to address words separately just like in the LVMs. The second solution
is to make it possible for some instructions to accept an immediate operand.
8.5
Vector Register File
In chapter 7 it was mentioned that a better performance of for example the DCT
could be achieved if the vector register file was increased. The vector memory is
of course a very expensive memory considering it must be a multiport memory.
Increasing the size of the register file will increase the total size of the core. The
penalty from the small register file is paid in program memory size in form of nop
instructions and extra cycles. Considering the length of the pipeline with its 15
stages the number of vector registers seems low. With more vector registers a
better instruction pipelining could be achieved. Only a few extra vector registers,
maybe as few as one or two, could give a substantial gain in performance for
certain applications where currently all 8 vector registers are used and a single
extra register could save many inserted nop instructions. The extra hardware cost
could be paid for by reducing the size of the LVMs. This is easy to see considering
only around 1400 vectors were used at most in one LVM. This might also be a bad
idea considering that the processor will be used for other applications which may
have more use for a large LVM. The DCT is one example where using the entire
LVM would be efficient if the DMA transfers can be hidden.
8.6
Register Forwarding
A lot of nop instructions were used in the blocks due to data dependency problems. This problem could be solved with register forwarding. This will increase
the hardware cost and the complexity of the core but can give a good boost in
performance.
8.7 New Instructions
8.7
81
New Instructions
A lot of new instructions were added during development of the blocks. This was
mainly because they were mentioned in the instruction set but not implemented
in the simulator. Some application specific instructions were also added that did
not exist in the instruction set from the beginning.
8.7.1
SAD Calculations
The newly added instructions HVBSUBWA, HVBSUBNA, HVBSUMABSDWA
and HVBSUMABSDNA will cost some extra hardware. The extra cost will come
from support for byte operand selection in the low half vector.
8.7.2
Call and Return
Call and return instructions were also proposed. This proposal did not offer an
unlimited size of the stack. The solution to not have a stack pointer in memory
offers some benefits which mainly is that access to LVMs would not be needed.
The call and return instructions could therefore be executed already in stage 2
of the pipeline. The call and return will therefore only have 1 delay slot and a
maximum of 1 cycle will be wasted to nothing when call or return is used.
8.8
Master and Sleipnir Core
A topic that came up during implementation was if the master core was too slow to
provide all the Sleipnirs with data. When using kernel 1 and 2 this was a problem
because the blocks finished execution too fast compared to how long it took for
the master to provide data to all 8 Sleipnirs. This problem became smaller when
increasing the workload for the Sleipnir cores. The question still stands due to
the fact that the workload of the master will be much higher when a complete
encoder is implemented. Utilizing the 3 LVMs better by DMA pre-loading of data
in the idling LVM a lot of DMA cycles can be hidden and a better utilization of
the Sleipnir cores can be achieved. The possibility for the master to spread its
workload over more cycles is thereby enabled but it will still have to complete a
lot of work.
Some of the problems were solved by making the Sleipnir blocks smarter. This
will of course impact on the complexity of the hardware in the Sleipnir core. If
the blocks were to be more straightforward calculation blocks the master would
have to take the workload of 8 Sleipnir cores decision making. In the case of the
motion estimation where a search takes (7 + (3 ∗ n) + 4) decisions, where n is the
number of points searched, each search would result in 8 times that many decisions
in average for the master to make. The result would also have to be copied to and
from the cores each time. With the cycle cost of the DMA start and configure in
mind this solution grows fast in theoretical cycle cost. The message box system
could of course be utilized and therefore the DMA cost will disappear but the
82
Discussion
decisions would still have to be made and the evaluation of the messages would
need some calculation time.
Another thing to consider is the data copying back and forth from and to the
LVMs. This could be avoided if the Sleipnir program memory was reprogrammed
instead, leaving the large amount of data intact in the LVMs. Reprogramming the
Sleipnir PM and CM to another blocks functionality is relatively cheap compared
to the DMA transfer of one full LVM. This can not be utilized on a complete frame
because the LVM is insufficient. If the frames are divided into slices of preferable
size this could be exploited and a frame could be encoded in a number of stages
where each slice is a stage. This approach could save a lot of DMA transfers.
8.9
ePUMA H.264 Encoding Performance
The thesis aimed to evaluate how much capability the ePUMA processor offered
for H.264 encoding. The results from motion estimation including motion compensation show that it can be done in less than 3 Mega cycles on 8 Sleipnir cores.
Motion estimation is estimated to consume from 60% to 80% of the total encoding
time [15]. There are some difference between the simplified motion estimation that
is performed in this thesis and the motion estimation in the paper. Block 5 does
not include motion estimation for blocks smaller than 16x16 pixels. Therefore
this estimation of encoding time might be misguiding in this case. Estimating the
DCTs total cycle consumption when using 8 cores it can be seen that it is a lot
smaller than the cycle consumption of motion estimation. This tells us that the
motion estimation still consumes a major part of the encoding time and will likely
have the majority of the Sleipnir cores dedicated to it if the tasks of a complete
encoder is divided amongst the Sleipnirs.
8.10
ePUMA Advantages
When programming the ePUMA architecture some features were used more than
others and they will be presented in this section. The possibility of making conditional execution on every instruction in the Sleipnir core was something that was
used on a number of different places. The sections in the code where it was used
saved some instructions of memory and also execution time.
Permutation of vectors and memory was another feature that was used. This
feature was used when transposing matrices in the DCT and Quantization block
and made the execution time of the block significantly faster.
The pipeline offered two stages of ALUs which made it possible to calculate
the sum of 8 words in one instruction. This was used in the calculation of the
SAD in the motion estimation block. This was also used in the proposed complex
instructions.
8.11 Observations
8.11
83
Observations
The DCT task cycle length is proportional to the amount of data that is copied to
the LVMs. As can be seen in figure 8.1 the DCT does not need the same advanced
setup of task scheduling as the ME needed which can be seen in figure 6.9.
Main Memory
Result in Main Memory
Sleipnir DCT Execution
Result 2
Sleipnir 1
Result 1
Result 0
Task 4
Task 3
Task 2
Task 1
Task 0
Sleipnir 0
Sleipnir 2
Task 122
Task 121
Task 120
Task 119
Task 118
Sleipnir 3
Sleipnir 4
Time
Current Time
Figure 8.1: Sleipnir core DCT task partitioning and synchronization
Local Vector Memory
B1
B2
B3
B4
B5
B6
Local Vector Memory
B7
B8
B1
B2
B3
B4
B5
B6
B7
B8
X
X
M A B C D E F G H
I
J
K
L
X
G H
L
X
M A B C D E F
I
X
J
X
K
X
X
Figure 8.2: Memory allocation of macroblock in LVM for intra coding
Due to the time constraint some parts of the encoder were left out. One of
these parts was intra coding i.e. coding of an I-frame and I-slice. This is a very
interesting part of encoding and will probably need some computation power. A
similar problem has already been solved on the STI Cell processor in [13]. The
partitioning of the frame done there could be used and adapted to the ePUMA
memory size. The size of the slices would probably have to be slightly smaller.
The problem that has to be solved is the memory allocation when calculating
the intra prediction. A memory allocation mapping that could be used can be seen
84
Discussion
in figure 8.2. As can be seen the memory is displaced one word for every vector.
This gives the opportunity to copy data to a new memory and permute the data
in two instructions when intra coding. The drawback is that more memory has to
be allocated. The figure describes a half macroblock residing in a LVM. On the
right side of the separator the memory is displaced one word to be able to copy a
column of data namely I, J, K, L. B1, B2, B3 and so on are the 8 memory banks
in the LVM.
Another part that was left out was the implementation of the deblocking filter.
The function of the deblocking filter is described in section 3.6. This task is
complex due to a lot of data dependency. Work has been done on the STI Cell
processor in [11] which may be worth looking into. The technique of using a wave
front when applying the filter is a good idea and will work on the ePUMA as well.
To solve the memory allocation problem when applying the deblocking filter
a similar displacement of vectors as in figure 8.2 can be used. This will make it
possible to read samples both row wise and column wise in one instruction each.
After that the filter can be applied.
Chapter 9
Conclusions and Future
Work
This chapter gives the conclusions of the work done in this thesis and proposes the
future work that could be done in the area.
9.1
Conclusions
In this thesis selected parts of a video encoder using the H.264 standard were
implemented and benchmarked using the ePUMA system simulator. The parts
focused upon were motion estimation, motion compensation, DCT, IDCT, quantization and rescaling. The answers to the questions at issue from section 5.1.1
will be presented here.
Is it possible to perform real-time full HD video encoding at 30 FPS
using the H.264 standard in the ePUMA processor?
This is not a simple question and it can not be fully answered from the work
done in this thesis. From the results obtained it can be seen that it is possible
to perform a simplified version of motion estimation and compensation on one of
the full HD frames from the riverbed video sequence in slightly less than 5 Mega
cycles on 4 Sleipnir cores. If this is good enough for real-time encoding at 30
frames per second depends on the clock frequency of the processor. The cycle cost
per frame for the DCT with quantization and IDCT with rescaling Sleipnir blocks
were approximated to take less than 5 Mega cycles if running on one Sleipnir core
each. This means motion estimation and compensation will still make out the
larger part of calculations for complex video sequences. Motion estimation and
compensation will not be performed if a frame is chosen to be intra coded, this
means a future intra coding block could use the 4 Sleipnirs otherwise used by
the motion estimation and compensation block. A setup like this would leave 2
Sleipnir cores to handle the deblocking filter.
85
86
Conclusions and Future Work
Would it be possible to modify the processor architecture to reach better performance and if so, would it be worth the cost of the potentially
added hardware?
During the work done in the thesis a few ideas of hardware improvement have been
thought of. The first one being functionality for call and return control intructions
using a small hardware stack. This enables quick calls and returns without having
to implement a software stack in memory. For most applications having only a few
levels of the stack is enough and the speedup gained from such a small hardware
cost makes it well worth it.
The second part of added functionallity would be byte selection from vector
operands. This would be used in the proposed instructions depicted in appendix
A.
What are the cycle costs compared to the STI Cell H.264 encoder?
The DCT and quantization and IDCT and rescaling Sleipnir blocks implemented
on the ePUMA both have a cycle cost of about 50% compared to the implementations on the STI Cell processor. The simplified motion estimation and compensation kernel implemented has a better best case and worse worst case cycle cost
compared to the STI Cell implementation.
9.2
Future Work
There are many more interesting parts to work on and investigate in this area.
One of them is to investigate if additional motion vectors should be calculated in
the motion estimation and compensation Sleipnir blocks for a greater performance
gain and if it would be beneficial to use a more square frame partitioning.
Functionality for performing motion estimation and motion compensation on
the complete frames including the edges could also be interesting to implement to
see how well the message boxes of the Sleipnir cores can handle the special cases
that arise.
An intra prediction Sleipnir block could be developed to investigate the computational complexity of that part of an encoder and how well the task can be
parallelized on the ePUMA architecture.
The deblocking filter would be interesting to see if it can be executed on two
Sleipnir cores while consuming less cycles than motion estimation executed on four
Sleipnir cores.
Another task needed for the implementation of a final ePUMA H.264 encoder
would be to investigate what overhead and additional information that has to be
appended to each macroblock or frame.
The final step would be to make a complete encoder by writing master code
that can use and coordinate all the Sleipnir blocks.
Bibliography
[1] 1080p test sequences.
ftp://ftp.ldv.e-technik.tu-muenchen.de/pub/test_sequences/1080p/,
2010.
May
[2] ePUMA research team in Div of Computer Engineering. ePUMA Platform
Hardware Architecture. February 2010.
[3] ePUMA research team in Div of Computer Engineering. ePUMA simulator
manual. March 2010.
[4] ePUMA research team in Div of Computer Engineering. Sleipnir Instruction
Set Manual. April 2010.
[5] P. List, A. Joch, J Lainema, G. Bjøntegaard, and M. Karczewicz. Adaptive deblocking filter. IEEE Transactions on circuits and systems for video
technology, July 2003. Vol. 13, No. 7.
[6] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky. Low-complexity
transform and quantization in h.264/avc. IEEE Transactions on circuits and
systems for video technology, July 2003. Vol. 13, No. 7.
[7] D. Marpe, H. Schwarz, and T. Wiegand. Context-based adaptive binary
arithmetic coding in the h.264/avc video compression standard. IEEE Transactions on circuits and systems for video technology, July 2003. Vol. 13, No.
7.
[8] Div of Computer Engineering at LiU. Senior assembler and simulator user
manual. http://www.da.isy.liu.se/courses/tsea26/labs/2009/
senior_assembler_and_simulator.pdf, September 2008.
[9] Div of Computer Engineering at LiU. Senior instruction set manual.
http://www.da.isy.liu.se/courses/tsea26/labs/2009/
senior_instruction_set_manual.pdf, September 2008.
[10] Iain E. G. Richardson. H.264 and MPEG-4 Video Compression. Wiley, 2003.
ISBN 0-470-84837-5.
87
88
Bibliography
[11] Lim Boon Shyang. A simplified high definition video encoder based on the sti
cell multiprocessor. Master’s thesis, Linköpings Tekniska Högskola, January
2007.
[12] International Telecommunication Union. H.264 : Advanced video coding for
generic audiovisual services. Technical report, ITU-T, 2009.
[13] Zhengzhe Wei. H.264 baseline real-time high definition encoder on cell. Master’s thesis.
[14] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra. Overview of the
h.264/avc video coding standard. IEEE Transactions on circuits and systems
for video technology, July 2003. Vol. 13, No. 7.
[15] D. Wu, B. Lim, J. Eilert, and D. Liu. Parallelization of high-performance video
encoding on a single-chip multiprocessor. In 2007 IEEE International Conference on Signal Processing and Communications, pages 145–148, November
2007.
[16] C. Zhu, X. Lin, and L. Chau. Hexagon-based search pattern for fast block
motion estimation. IEEE Transactions on circuits and systems for video technology, May 2002. Vol. 12, No. 5.
Appendix A
Proposed Instructions
Operand B (128-bit)
Operand A (128-bit)
x x x x x x x x
-
-
’1'
msb
0
-
’1'
1
0
-
’1'
msb
x x x x x x x x
msb
1
0
-
’1'
’1'
msb
1
0
-
’1'
msb
1
0
1
-
’1'
msb
0
1
-
’1'
msb
0
1
msb
0
1
’0'
’0'
’0'
’0'
’0'
’0'
’0'
’0'
ABS
ABS
ABS
ABS
ABS
ABS
ABS
ABS
+
+
+
+
+
+
x
x
x
x
x
x
Figure A.1: HVBSUMABSDWA
89
90
Proposed Instructions
Operand B (128-bit)
Operand A (128-bit)
x
x x x x x x x
-
-
’1'
msb
0
-
’1'
1
0
-
’1'
msb
x x x x x x x x
’1'
msb
1
0
-
msb
1
0
1
-
’1'
’1'
msb
0
1
-
’1'
msb
0
1
-
’1'
msb
0
1
msb
0
1
’0'
’0'
’0'
’0'
’0'
’0'
’0'
’0'
ABS
ABS
ABS
ABS
ABS
ABS
ABS
ABS
+
+
+
+
+
+
x
x
x
x
x
x
Figure A.2: HVBSUMABSDNA
Operand B (128-bit)
Operand A (128-bit)
x x x x x x x x
-
’1'
-
’1'
-
’1'
-
x x x x x x x x
’1'
-
’1'
-
Figure A.3: HVBSUBWA
’1'
-
’1'
-
’1'
91
Operand B (128-bit)
Operand A (128-bit)
x
-
x x x x x x x
’1'
-
’1'
-
’1'
-
x x x x x x x x
’1'
-
’1'
-
Figure A.4: HVBSUBNA
’1'
-
’1'
-
’1'
Appendix B
Results
Blue Sky
Sleipnirs
1
2
4
8
Sunflower
Sleipnirs
1
2
4
8
Pedestrain
Sleipnirs
1
2
4
8
Riverbed
Sleipnirs
1
2
4
8
Kernel 1
30191409
15349954
8095485
5954035
Kernel 2
19411816
9984269
5974027
5789997
Kernel 3
15895187
7985390
4047430
2121506
Kernel 4
15250896
7651458
3849476
1962095
Kernel 5
16564840
8312959
4187640
2141442
Kernel 1
27821589
14104241
7397368
5898648
Area
Kernel 1
32400765
16398183
8819394
6154442
Kernel 2
18200644
9277820
5742826
5797178
Kernel 3
14692631
7380247
3726547
1933467
Kernel 4
14048580
7043845
3544288
1802451
Kernel 5
15424708
7734366
3889893
1982586
Kernel 2
20493772
10491439
6454460
5825473
Kernel 3
16979339
8523687
4314053
2335474
Kernel 4
16334940
8182102
4113342
2101209
Kernel 5
17636560
8832762
4447240
2283589
Kernel 1
36427689
18463337
9630235
6169334
Kernel 2
22506736
11586716
6571084
5852373
Kernel 3
18990971
9537789
4824777
2500630
Kernel 4
18346788
9197313
4632170
2361100
Kernel 5
19673956
9866870
4975286
2540852
Table B.1: Simulation cycle cost of motion estimation kernels
92
Upphovsrätt
Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare —
under 25 år från publiceringsdatum under förutsättning att inga extraordinära
omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en
senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman
i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form
eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller
konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for his/her own use and
to use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses of
the document are conditional on the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security
and accessibility.
According to intellectual property law the author has the right to be mentioned
when his/her work is accessed as described above and to be protected against
infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity, please
refer to its www home page: http://www.ep.liu.se/
c Jonas Einemo, Magnus Lundqvist
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising