Institutionen för systemteknik Department of Electrical Engineering Examensarbete

Institutionen för systemteknik Department of Electrical Engineering Examensarbete
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
Secure Text Communication for the Tiger XS
Examensarbete utfört i Informationsteori
vid Tekniska högskolan i Linköping
av
David Hertz
LITH-ISY-EX--06/3842--SE
Linköping 2006
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
Secure Text Communication for the Tiger XS
Examensarbete utfört i Informationsteori
vid Tekniska högskolan i Linköping
av
David Hertz
LITH-ISY-EX--06/3842--SE
Handledare:
Tina Lindgren
isy, Linköpings universitet
Robin von Post
Sectra Communications
Examinator:
Viiveke Fåk
isy, Linköpings universitet
Linköping, 14 December, 2006
Avdelning, Institution
Division, Department
Datum
Date
Division of Information Theory
Department of Electrical Engineering
Linköpings universitet
S-581 83 Linköping, Sweden
Språk
Language
Rapporttyp
Report category
ISBN
Svenska/Swedish
Licentiatavhandling
ISRN
Engelska/English
Examensarbete
C-uppsats
D-uppsats
Övrig rapport
2006-12-14
—
LITH-ISY-EX--06/3842--SE
Serietitel och serienummer ISSN
Title of series, numbering
—
URL för elektronisk version
http://www.it.isy.liu.se
http://www.ep.liu.se/2006/3842
Titel
Title
Säkra Textmeddelanden för Tiger XS
Secure Text Communication for the Tiger XS
Författare David Hertz
Author
Sammanfattning
Abstract
The option of communicating via SMS messages can be considered available in
all GSM networks. It therefore constitutes a almost universally available method
for mobile communication.
The Tiger XS, a device for secure communication manufactured by Sectra,
is equipped with an encrypted text message transmission system. As the text
message service of this device is becoming increasingly popular and as options to
connect the Tiger XS to computers or to a keyboard are being researched, the text
message service is in need of upgrade.
This thesis proposes amendments to the existing protocol structure. It thoroughly examines a number of options for source coding of small text messages and
makes recommendations as to implementation of such features. It also suggests
security enhancements and introduces a novel form of stegangraphy.
Nyckelord
Keywords
source coding, text compression, cryptography, steganography, PPM
Abstract
The option of communicating via SMS messages can be considered available in
all GSM networks. It therefore constitutes a almost universally available method
for mobile communication.
The Tiger XS, a device for secure communication manufactured by Sectra,
is equipped with an encrypted text message transmission system. As the text
message service of this device is becoming increasingly popular and as options to
connect the Tiger XS to computers or to a keyboard are being researched, the text
message service is in need of upgrade.
This thesis proposes amendments to the existing protocol structure. It thoroughly examines a number of options for source coding of small text messages and
makes recommendations as to implementation of such features. It also suggests
security enhancements and introduces a novel form of stegangraphy.
v
Acknowledgements
Several people have provided valuable aid in the process of working with this
thesis. I would like to thank the following people in particular:
Robin von Post for his valuable feedback and ideas as well as proofreading
assistance.
My examiner, Viiveke Fåk, for giving feedback and on the work and report of
this thesis.
Andreas Tyrberg for aiding me with LATEX typesetting issues.
Cem Göcgören, Stig Nilsson and Tina Brandt at Regeringskansliet, Rikspolisstyrelsen and MUST, respectively, for their input on the functionality of the Tiger
XS.
Christina Freyhult for providing valuable feedback.
My family for proofreading assistance.
vii
Contents
1 Introduction
1.1 Background .
1.2 Purpose . . .
1.3 Methodology
1.4 Disposition .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 The Tiger XS
2.1 Tiger XS . . . . . . . . . . . . . . . .
2.2 Communication Security . . . . . . .
2.2.1 The Need for Communication
2.2.2 Security Required . . . . . .
2.2.3 Security Offered by GSM . .
2.2.4 Security Offered by Tiger XS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
1
2
. . . . .
. . . . .
Security
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
5
5
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Protocol
3.1 Underlying Protocols . . . . . . . . . . . .
3.1.1 Encrypted Short Message Protocol
3.1.2 Carrier Protocols . . . . . . . . . .
3.2 Protocol Structure . . . . . . . . . . . . .
3.2.1 Protocol Coding . . . . . . . . . .
3.2.2 Message Control Protocol (MCP) .
3.2.3 Object Control Protocol (OCP) . .
3.2.4 Object Specific Protocols . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
9
10
11
11
11
4 Source coding
4.1 Purpose of This Chapter . . . . . . . . .
4.1.1 Purpose . . . . . . . . . . . . . .
4.1.2 Prerequisites . . . . . . . . . . .
4.1.3 Limitations . . . . . . . . . . . .
4.2 Definitions . . . . . . . . . . . . . . . . .
4.3 Source Coding Evaluation Environment
4.4 Source Coding Basics . . . . . . . . . . .
4.4.1 Entropy and Source Coding . . .
4.4.2 Generalized Source Coding . . .
4.4.3 Statistics . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
16
16
16
17
17
17
18
18
ix
.
.
.
.
.
.
.
.
.
.
x
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
21
22
22
23
24
25
37
53
55
55
56
58
60
61
5 Security
5.1 Purpose of This Chapter . . . . . . . . . . . . . . .
5.1.1 Purpose . . . . . . . . . . . . . . . . . . . .
5.1.2 Prerequisites . . . . . . . . . . . . . . . . .
5.1.3 Limitations . . . . . . . . . . . . . . . . . .
5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . .
5.3 Security Basics . . . . . . . . . . . . . . . . . . . .
5.3.1 Possible Attacks . . . . . . . . . . . . . . .
5.3.2 Cryptography Objectives . . . . . . . . . .
5.3.3 Symmetric and Asymmetric Ciphers . . . .
5.3.4 Block Ciphers and Stream Ciphers . . . . .
5.3.5 Initialization Vectors . . . . . . . . . . . . .
5.3.6 Cryptographic Hash Algorithms and MACs
5.3.7 Challenge-Response Schemes . . . . . . . .
5.4 GSM Security . . . . . . . . . . . . . . . . . . . . .
5.4.1 GSM Security in Brief . . . . . . . . . . . .
5.4.2 Vulnerabilities . . . . . . . . . . . . . . . .
5.5 Encrypted short MeSsaGe (EMSG) . . . . . . . . .
5.6 Extending EMSG . . . . . . . . . . . . . . . . . . .
5.6.1 Centrally Assigned Session Keys . . . . . .
5.6.2 Peer-to-Peer (P2P) Session Establishment .
5.6.3 Pre-setup Session Keys . . . . . . . . . . .
5.7 Source Coding and Cryptography . . . . . . . . . .
5.8 Steganography Using PPM . . . . . . . . . . . . .
5.8.1 The Protocol . . . . . . . . . . . . . . . . .
5.8.2 Adapting PPM for Steganography . . . . .
5.8.3 Detecting PPM Steganography . . . . . . .
5.8.4 Cryptographic Aspects . . . . . . . . . . . .
5.8.5 Practical Test . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
65
65
66
66
66
67
68
68
69
70
70
71
72
72
73
74
74
75
81
85
87
87
88
90
91
95
95
4.5
4.6
4.7
4.4.4
4.4.5
Source
Source
4.6.1
4.6.2
4.6.3
4.6.4
4.6.5
4.6.6
4.6.7
4.6.8
4.6.9
4.6.10
Source
4.7.1
Predictors . . . . . . . . . . . . . .
Coding . . . . . . . . . . . . . . .
Coding Using the Tiger XS . . . .
Coding of Text . . . . . . . . . . .
The Entropy of Text . . . . . . . .
Static and Adaptive Approaches .
Early Design Decisions . . . . . . .
Dictionary Techniques . . . . . . .
Predictive Techniques . . . . . . .
Preprocessing Text . . . . . . . . .
Lossy Text Coding . . . . . . . . .
Variable Algorithm Coding . . . .
GSM Text Compression . . . . . .
Other Algorithms . . . . . . . . . .
Coding of Transmission Protocols
Coding of Numeric Fields . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusions, Recommendations and Further
6.1 Conclusions . . . . . . . . . . . . . . . . . . .
6.1.1 Communication Protocols . . . . . . .
6.1.2 Source Coding . . . . . . . . . . . . .
6.1.3 Security . . . . . . . . . . . . . . . . .
6.2 Recommendations . . . . . . . . . . . . . . .
6.2.1 Communication Protocols . . . . . . .
6.2.2 Source Coding . . . . . . . . . . . . .
6.2.3 Security . . . . . . . . . . . . . . . . .
6.3 Further Studies . . . . . . . . . . . . . . . . .
Studies
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
A Language Reference
A.1 DN-4 . . . . . . .
A.2 CaP . . . . . . .
A.3 RodaRummet . .
A.4 Bibeln . . . . . .
A.5 Nils . . . . . . .
A.6 Macbeth . . . . .
97
97
97
97
99
100
100
100
101
101
103
Files
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
105
105
105
106
106
106
B Performance Reference Files
B.1 Jordbävning . . . . . . . . .
B.2 Nelson . . . . . . . . . . . .
B.3 Diplomatic-1 . . . . . . . .
B.4 Jalalabad . . . . . . . . . .
B.5 Blair . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
107
107
108
108
.
.
.
.
.
.
C Source Coding Graphs
109
C.1 Dictionary Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 109
D Steganographic Texts
112
D.1 DN-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
D.2 CaP-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
E Source Coding Evaluation Environment
115
E.1 Screen Shots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
F Acronyms
118
xii
Contents
Chapter 1
Introduction
This document was written as the report of a Master of Science thesis in Applied
Physics and Electrical Engineering at the Department of Electrical Engineering at
Linköping Institute of Technology. The task was performed at Sectra Communications AB.
1.1
Background
The Tiger XS is a battery-powered, handheld device offering encrypted voice and
data services. The Tiger XS as well as other Sectra products provide means for
transmitting and receiving encrypted text messages via the GSM Short Message
Service (SMS).
Encrypted voice and data services made available using the Tiger XS are carried using a GSM Circuit Switched Data (CSD) channel. The availability of CSD
channels is limited as not all GSM-operators provide such services. In some countries CSD-channels are not available at all. SMS is, however, almost universally
available and requires less configuration hassle for roaming end-users.
It is of great interest for Sectra to extend the current message channel into a
more advanced and efficient channel capable of carrying several types of content.
1.2
Purpose
The purpose of this thesis is to describe possible solutions to extend the current
message channel into a more advanced channel capable of carrying several types
of content in an efficient and secure way.
1.3
Methodology
Initially a survey concerning which additions in terms of functionality were desired
in this new protocol was performed. Discussions were held with customers of Sectra
1
2
Introduction
Communications regarding their view of the current system and their priorities in
terms if enhancements.
The following tasks were carried out:
• Design of a protocol capable of distributing interactive content of varying
types. Protocols are discussed in chapter 3.
• Design of source coding systems capable of efficiently encoding textual data
as well as other types of data. Source coding is discussed in chapter 4.
• Explore amendments and changes to the current encryption system in order
to adapt it to new transmission channels. Security is discussed in chapter 5.
In order to accomplish the second task an environment for evaluation of different source coding methods was developed. It includes full implementations of
most algorithms described in chapter 4. Performance of algorithms and methods
included in chapter 4 derives from this environment.
In order to create models for textual grammar a set of language reference files
were assembled, these contain written text, each file in a different language or style,
and serves as norms for written text in their respective language or style. Furthermore a larger set of performance test messages was derived, partly in cooperation
with customers of Sectra Communications. These two sets of files are discussed
in the source coding chapter and presented in greater detail in appendix A and
appendix B.
1.4
Disposition
Chapter 2 - Tiger XS introduces the Tiger XS.
Chapter 3 - Protocols introduces the protocols involved and suggests amendments to them.
Chapter 4 - Source Coding discusses source coding.
Chapter 5 - Security discusses security issues.
Chapter 6 - Conclusions, Recommendations and Further Studies summarizes
this thesis and makes recommendations.
Chapter 2
The Tiger XS
Secure communication is a service that is imperative for many organizations. This
chapter will introduce a cryptographic communications device central to this thesis, the Tiger XS, and briefly explain the necessity of secure communication.
2.1
Tiger XS
The function of the Tiger XS unit is to facilitate a simple mean to telephone and
exchange messages confidentially.
The Tiger XS is fitted with a microphone and a telephony speaker and is capable of source coding and ciphering voice data. It is also fitted with a joystick input
device and screen and can be used to compose, send and receive text messages.
The Tiger XS can be connected to various different communication devices
using either a bluetooth radio connection or a serial connection. The connected
communication device is used to transmit data between users. The by far most
common setup is to connect the Tiger XS using bluetooth to a Global System for
Mobile communication (GSM) telephone.
Voice calls made with the Tiger XS using a GSM telephone utilizes the GSM
CSD connection to exchange data during the call. SMS messages are used when
transmitting text messages using GSM telephones. Figure 2.1 depicts the typical
use of the Tiger XS.
Keys needed when using the Tiger XS are provided by a separate SIM-card
that is to be removed from the unit when not in use. A picture of the Tiger XS
and accompanying key is found in figure 2.2.
2.2
2.2.1
Communication Security
The Need for Communication Security
A typical scenario where a lack of communication security could hurt a civilian
business is that involving industrial espionage during bidding processes. Sellers
3
4
The Tiger XS
Figure 2.1. Tiger XS operation using GSM
Figure 2.2. Tiger XS and access card
2.2 Communication Security
5
may have a great interest in learning of the offers and respective prices of competing sellers. Proprietary business information such as business strategies, research
and development results and such may also be targeted. Proprietary business information may have great effect on stock exchange prices and it may therefore be
of great economic value to know it in advance.
Military organizations commonly employ secure communication as they are assumed to be targeted by military espionage. The robust communication systems
used in times of war are not the only communication systems used by military
organizations. Systems that rely on civil infrastructure, such as mobile telephony
networks, may be used in peacetime. The ramifications of exposing military secrets, even in peacetime, could be severe. Such ramifications warrants that the
communication be made secure.
2.2.2
Security Required
An obvious security requirement is that the communication should be kept confidential. The length of time in which the communication must remain confidential
is of importance, as this determines the amount of time and benefits in terms of
technological advances is made available to parties attempting to gain access to
the contents of the communication.
When communicating through text messages asserting the identity of the party
which one communicates with, becomes especially significant as text messages,
unlike voice communication, are easily forged.
2.2.3
Security Offered by GSM
Messages and calls sent over the air using GSM are encrypted. Calls and messages
are only encrypted while transmitted to the GSM base station, this is important as
anyone with access to the telephony communication at the base station or beyond
have access to the messages and calls. GSM network providers, governments, long
distance carriers and anyone who may, by some means, gain access to one of the
mentioned parties will be able to eavesdrop on any call or message.
The lack of confidentiality when using GSM to communicate is, however, not
limited to the problem of network providers having full access to the unencrypted
data as the encryption methods employed to protect the communication with the
base station has serious flaws. The encryption methods used in GSM and the
problems with them is described in more detail in section 5.4.
In GSM the authenticity of a SMS message is not verifiable and the phone
number of the sender could be set to whatever the sender wishes to set it to.
2.2.4
Security Offered by Tiger XS
The Tiger XS provides confidentiality beyond what is possible in GSM by employing end-to-end encryption. When using endpoint-to-endpoint encryption only the
two parties communicating is in possession of the means necessary to encrypt and
decrypt the data sent. The messages and calls are thereby relayed by the network
6
The Tiger XS
providers in encrypted form, which ensures privacy given that the encryption is
not broken. In addition, messages and calls are also encrypted using algorithms
that are more difficult to break than those used in GSM.
The authenticity of messages sent using the Tiger XS is verified using a shared
secret. Knowledge of this is assumed to be equivalent with being a trusted party.
The exact algorithms and cryptographic systems used by the Tiger XS is not
public information as exposing them would pose a threat to national security.
Details of the algorithms and cryptographic systems employed will therefore not
be included in this report. A rudimentary, more technical, description of the
security offered by the messaging system can be found in section 5.5.
Chapter 3
Protocol
This chapter describes communication protocols used by the Tiger XS for exchanging messages. It also presents a structure for extending current protocols in order
to enable distributing interactive content of varying types.
3.1
Underlying Protocols
The text message is carried using several transport protocols. The protocols of
consequence for the implementation of messaging services on the Tiger XS are
described in this section. The text message is encrypted and carried using the
EMSG protocol, this is then carried using an SMS. The transfer method is depicted
in figure 3.1.
(a) Encapsulation
(b) Protocol stack
Figure 3.1. The underlying protocols
3.1.1
Encrypted Short Message Protocol
The Encrypted Short Message (EMSG) protocol defines the method of short message communication between two handheld Sectra communication devices, such
7
8
Protocol
as the Tiger XS. The security aspects of the EMSG protocol is discussed in section 5.5. The EMSG protocol has an overhead of 58 bytes and enables the secure
transmission of up to 65478 bytes of data.
3.1.2
Carrier Protocols
The encrypted data is carried between Tiger XS units using external devices and
standardized protocols. The most common way for encrypted text messages to
be carried between Tiger XS units is by SMS messages using a GSM telephone
connected to the Tiger XS via bluetooth. There are however more options, some
of which are described below.
3.1.2.1
Short Message Service (SMS)
SMS facilitates a method for sending small text messages using a GSM-network.
The transfer method of interest is point-to-point and defined in the GSM standard 03.40 [12]. A cell-broadcast protocol for sending messages to all subscribers
connected to a specific base station also exists, though being of no interest for the
application discussed in this paper.
An SMS message can contain a maximum of 140 bytes.
An UDH header can be prepended to the message. The UDH header extends the SMS protocol giving it the ability to encode multimedia content such
as ringtones and voice mail indications. One function provided by the UDH is
of special interest - the ability to span message contents over a maximum of 255
SMS messages enabling the transmission of data of up to 34170 bytes. The ability
to transfer messages of greater length than 140 bytes is of great interest as the
EMSG protocols has an overhead of no less than 58 bytes giving the encrypted
payload a maximum length of 82 bytes. Structure of a single SMS message as
well as structure of a concatenated SMS message are displayed in figure 3.1 and
figure 3.2.
3.1.2.2
Short Data Service (SDS)
SDS is a text messaging service available in TErrestial Trunked RAdio (TETRA)
networks.
A SDS message can be one of four different modes: usermode-1, usermode-2
and usermode-3 for sending messages of length 16, 32 and 64 bits respectively,
and usermode-4 for sending messages with payload length between 0 and 2039
bits, that is, roughly 254 bytes. Only usermode-4 is of interest for the type of
messaging considered here, as the payload capabilities of the other usermodes is
insufficient.
In addition to the ability to send data of up to 254 bytes length a UDH header
similarly defined as that in GSM can be perpended. Using the UDH header message data can be spanned over a maximum of 255 SDS messages giving a payload
of up to 64770 bytes. Spanning messages over several SDS message increases the
risk of failed delivery and spanning the message over 255 different SDS messages
is hardly possible in practice. The structure of SDS messages is similar to SMS
3.2 Protocol Structure
9
(a) Single SMS (82 bytes total payload)
(b) Concatenated SMS using UDH (210 bytes total payload)
Figure 3.2. SMS as carrier
messages as displayed in figure 3.2 but with a total payload of 196 and 438 bytes
respectively.
3.1.2.3
Multimedia Messaging Standard (MMS)
MMS is a A 3GPP-developed message system for 2.5G or 3G mobile telephony
networks. MMS messages are sent encapsulated over the WAP-protocol. MMS are
carried over GPRS in 2.5G networks and allows for sending messages of arbitrary
length.
The use of MMS mitigates problems with limitations on the message length.
Unlike its predecessor SMS, MMS is not universally available and may require
special calling plans and configurations.
3.2
Protocol Structure
In this section a protocol enabling the transmission of interactive content of varying types is described. The protocol was developed as a part of this thesis and
constitutes the suggested method of messaging presented in this thesis.
A message is represented by one or more objects. Such objects include text
sections, multiple choice questions, contact information etc. Object data are coded
with an object specific protocol and encoding. These object specific protocols are
described in section 3.2.4.
The encoded objects are assembled into a message using the Object Control
Protocol (OCP) described in section 3.2.3, yielding a message data stream.
The message data stream is encoded using the Message Control Protocol (MCP)
which enables message content to be spanned across multiple EMSG messages.
10
Protocol
MCP enables an interim implementation of the ability to span the message data
stream across multiple carrier messages. A more favorable implementation than
one based on MCP would abandon MCP in favor of implementing using the equivalent functions present in the carrier level protocol and/or in the EMSG protocol.
The protocols are depicted in figure 3.3.
(a) Encapsulation
(b) Protocol stack
Figure 3.3. The underlying protocols
3.2.1
Protocol Coding
The following section assumes two methods of coding data, plain byte-oriented
and plain bit-oriented coding. The two coding-methods mentioned and another
3.2 Protocol Structure
11
two methods of coding are presented in section 4.4.5. As the latter two methods
rely on some sort of source coding, these two methods to code protocol data are
discussed in section 4.7.
It should be noted that the functions of the protocols are identical regardless of
coding, but the protocol data could be compressed if one of the latter two methods
are used.
3.2.2
Message Control Protocol (MCP)
The message data stream may need to be transmitted using several EMSG:s, MCP
allows this by mimicking the functionality of the UDH header found in SMS and
SDS. As noted above the functionality provided in the MCP would preferably be
implemented at the carrier protocol level.
The header is comprised of:
MessageRefNumber A message reference number used to identify message parts
belonging to the same message.
MessageMaxNumber The number of message parts in this message.
MessageSeqNumber The number of this specific message part.
3.2.3
Object Control Protocol (OCP)
The encoded objects are assembled into a single message using the object control
protocol. For each object to be transferred two header fields and the actual object
data is appended to the message. This is illustrated in figure 3.4.
The data is encoded using the following three fields:
ObjectType The type of object, represented using a numeric identity.
ObjectLen Length of the ObjectData field.
ObjectData The actual object data.
Figure 3.4. OCP structure
3.2.4
Object Specific Protocols
This section describes the object specific protocols. Different protocols are specified for each type of object.
12
3.2.4.1
Protocol
Text Transfer Protocol (TTP)
The Text Transfer Protocol contains a single field indicating the encoding method
used. This may or may not be a text source coding method as described in chapter
4.
The header is comprised of:
Encoding Indicating the encoding used.
A fixed set of encodings are assumed to be agreed upon by all units prior to
use. As static source coding methods are highly language-dependent, encodings
using the same source coding methods but different statistical prerequisites, as
derived from language, may be used by indicating different values for the encoding
field. Encodings would preferably include an uncompressed 7-bit character set as
well as a uncompressed 8-but character set.
3.2.4.2
Multiple Choice Questions Transfer Protocol (QTP)
The input devices available on the Tiger XS, a joystick and two buttons, significantly limits the ability to input longer messages. Because of the limited abilities
to input messages a method for allowing mobile users to respond using one or
possibly a few multiple choice questions has been requested by customers. QTP is
constructed to offer a method for communicating a single question and alternatives
with which the users could respond.
The communication is composed of a QTP Request relaying the question followed by a QTP Response indicating the alternative selected by the user being
questioned. The question is identified using an identity field in conjunction with
the phone number of the sending and receiving party, respectively.
A QTP Request (QTP-REQ) is composed of the following fields:
QuestionId An identity allowing responses to be associated with their respective
questions.
QuestionLen Length of the question field.
Question The actual question.
AlternativeLen Length of Alternative field.
Alternative A response alternative.
The AlternativeLen and Alternative fields are iterated until all response alternatives are included, this is marked by setting AlternativeLen to zero.
A QTP Response (QTP-RES) is composed of the following fields:
QuestionId An identity allowing responses to be associated with their respective
questions.
SelectedAlternativeNum The index number of the selected alternative.
3.2 Protocol Structure
3.2.4.3
13
Contact Information Transfer Protocol (CTP)
Functionality for keeping track of and updating users phone books is a functionality requested by the users of the Tiger XS. CTP provides a simple method for
transmitting phone book entries. As entries on the Tiger XS is composed of pairs
of names and phone numbers, this format only includes that data.
Names and phone numbers are fields with varying length, fields indicating name
length and phone number length are included in the headers:
NameLen Length of the Name field.
Name Name of the contact.
PhoneLen Length of the Phone field.
Phone Phone number of the contact.
If a more advanced protocol for transmitting contact information is desired, the
recommended action would be to implement a vCard parser for sending and receiving phone book entries in the vCard format. The vCard format is standardized
by the Internet Mail Consortium and described in Request For Comment (RFC)
2425 and 2426 described in [14] and [10], respectively.
14
Protocol
Chapter 4
Source coding
This section introduces source coding concepts and presents a set of algorithms,
some of which are explored in depth and evaluated for use with the Tiger XS.
4.1
4.1.1
Purpose of This Chapter
Purpose
Source coding of the text being sent is of great interest, as there is an expressed will
to be able to send longer text messages than the currently available 80 characters.
Unlike many source coding problems, where nothing can be assumed about the
structure of the data to be encoded, this situation allows for the assumption that
the data is comprised of text. Furthermore it can be assumed that the language
of the messages to be used with the Tiger XS handset is known at the time of
the production of the unit. It is therefore also assumed that the settings of the
algorithm to be employed may be varied, depending on the exact language to be
used, to further optimize the source coding.
The methods that may be employed are limited as the available computational
resources are limited, see section 4.1.2.
The precise purpose of this chapter is to achieve these four results:
• To implement, verify and derive results for a set of source coding methods
in order to assemble a list of recommended implementations. The list shall
include several different implementations, starting with a simple implementation and ending with the most promising implementation in terms of performance. This approach assumes the existence of a performance/complexitytrade-off, a notion that is all to present in modern source coding literature.
• To include text examining the different tweakable aspects, of the algorithms
described in detail, in order to simplify decision making when implementing
these.
• To include descriptions of algorithms that have not been tested, yet may be
of significance.
15
16
Source coding
• To include rudimentary descriptions of other well known algorithms as well
as a motivation why these have not been further examined.
4.1.2
Prerequisites
Unlike many other platforms, such as PC:s, the Tiger XS offers relatively little
computational resources. Furthermore source coding text messages is not the
prime objective of the unit and it should be assumed that the processor time and
memory available to source coding is but a portion of the whole units processor
time and memory. It is assumed that algorithms of great complexity may be
unfavorable to implement. Computational prerequisites is examined in more detail
in section 4.5
4.1.3
Limitations
Testing all available source coding algorithms in detail is beyond the scope of
this thesis. Therefore focus has been on a smaller set of algorithms that have
been deemed appropriate given the prerequisites. Closely examined algorithms
have been assumed to achieve good performance given their complexity level and
requirements in terms of computational resources.
4.2
Definitions
The following terms are used extensively throughout this chapter:
Character A single character from an alphabet that is to be sourcecoded.
Symbol A representation of a character or metacharacter.
Symbolweight A object associating one or more symbols with specific probabilities.
Alphabet A set of characters.
Predictor A predictor predicts which symbol may appear next in a symbol stream
and with which certainty.
Codec A method of coding symbols using symbolweights.
Rate The amount of coded data generated per uncoded data. Measured in bits
per character or bits per symbol.
Entropy The amount of information contained in a dataset.
Huffman code A type of coding using a single bit-vector of variable size to represent a single or a few symbols.
Arithmetic code A type of coding using a single bit-vector of variable size to
represent all symbols contained in the message.
4.3 Source Coding Evaluation Environment
4.3
17
Source Coding Evaluation Environment
In order to evaluate the different source coding methods described in this thesis a
tool for testing was created. This tool is the Source Coding Evaluation Environment (SCEE). It was developed in Visual Studio using C# and .NET framework
version 2.0.
SCEE is built around the source coding model described in section 4.4.2 and
is built for maximum “tweakability”.
The SCEE has built in tools for the following:
- Entropy Estimation
- Creation of Dictionary coding methods (see section 4.6.4) using different
methods and variables.
- Creation of PPM coding methods (see section 4.6.5.6) using different methods and variables.
- Visualization of allocation of bits when coding text.
- Assessment of Steganographic methods (see section 5.8).
- Automatized performance measurements.
A small set of screenshots of the SCEE is included in appendix E.
4.4
Source Coding Basics
Source coding is commonly divided into two categories:
Lossless coding Wherein a dataset is represented with another, preferably shorter,
dataset via a injective and reversible function. The exact original data can
always be reproduced given the encoded data.
Lossy coding Wherein a dataset is represented with another, preferably shorter,
dataset via a irreversible function. The original data can generally not be
reproduced given the encoded data, however a dataset deemed to be sufficiently close in meaning to the original data can be inferred from the encoded
data. Lossy encoding is primarily used when encoding images and audio.
This chapter will almost exclusively deal with the former form of coding (although
the latter will be visited briefly).
4.4.1
Entropy and Source Coding
A data stream emanating from a data source is said to have an entropy, commonly
measured in bits/character. The entropy of a data stream is a measure of the
amount of information present in the data stream and thus forms a bound on the
performance of all source coding methods. Entropy of text is explored in more
detail in section 4.6.1.
18
Source coding
Figure 4.1. The encoding process
4.4.2
Generalized Source Coding
The source coding algorithms described in this thesis can all be fitted in a common
model. This is the model implemented in the source coding evaluation environment
and this model is presented here.
The individual components in this model are discussed in section 4.4.3 through
section 4.4.5.
The following functions are carried out by the components in figure 4.1 and
figure 4.2.
SymbolTranslator Translates an object such as text-character or a protocol-field
into a symbol or vice-versa.
Predictor Predicts which symbols will be seen next and with what probability.
See section 4.4.4.
Codec Encodes or decodes symbols using their probabilities. See section 4.4.5.
Coder Utilizing the functions of the predictor and of the codec in order to encode
the symbols.
4.4.3
Statistics
Source coding exploits asymmetries in probability. In order to approximate probabilities statistics is used. Statistics in the source coding evaluation environment
are in the form of SymbolWeight-objects associating single symbols with a relative
weight and also providing the total weight of all symbols having weight.
Given a symbol, ai , and an alphabet, A = {a0 , . . . , ar−1 }, symbolweights,
w(ai ), relate to probabilities, p(ai ), in the following manner
W =
r−1
X
i=0
w(ai )
4.4 Source Coding Basics
19
Figure 4.2. The decoding process
p(ai ) =
4.4.4
w(ai )
W
Predictors
Predictors predict which symbol may appear next and with which certainty.
This document will treat predictors as being one of the following two types:
Static predictors Making the same predictions regardless of previously coded
symbols.
Variable predictors Making different predictions depending on previously coded
symbols.
4.4.5
Coding
Given a set of symbols there are several ways of representing those in terms of
binary data. In most cases bits are grouped together in a larger fixed-length
constellation such as a byte and most or all of those states are associated with a
single symbol. If, however, not all symbols are equifrequent, it’s possible to choose
a representation of each symbol such that on average the representation of the
symbols in terms of bits per symbol is more efficient.
The following representations will be discussed in this document:
4.4.5.1
Plain Byte-oriented Coding
Given a set of n symbols in an alphabet An = {a0 , . . . , an−1 }, one could map those
to k = dlog256 ne bytes, representing a0 with the first of the 256k states and so
forth. This representation is crude and will, in most applications1 , result in an
expansion.
This coding method is included in the source coding evaluation environment
albeit not used when measuring performance.
1 where
n 6= 256k
20
Source coding
4.4.5.2
Plain Bit-oriented Coding
Given a set of n symbols in an alphabet An = {a0 , . . . , an−1 }, one could map those
to k = dlog2 ne bits, representing a0 with the first of the 2k states and so forth. This
representation is relatively simple and performs better than byte-representation in
most cases.
An implementation of this coder is present in the source coding evaluation
environment.
4.4.5.3
Variable Length Coding
To examine variable length coding, a random variable X with the outcomes defined
by the symbol alphabet Sn = {s0 , . . . , sn−1 } is used. X is assumed to have the
probability distribution P = {p0 , . . . , pn−1 } where pi = p(X = si ). If coding such
a set of symbols with a variable length code consisting of sequences of bits one
could show that on average a minimal number of bits will be needed to represent
sequences of symbols if and only if the codeword associated with the symbol si is
exactly log2 pi bits long for all possible values of i (see [18] for proof of this).
As log2 pi may not be an integer, one may need more bits - up to one more bit
to be precise. This gives a maximum codeword length of log2 pi + 1 bits for si .
Such a code will have a average rate, R, measured in bits/symbol bound by
Rmin =
n−1
X
i=0
n−1
X
−p(si ) log2 p(si ) ≤ R ≤ (
−p(si ) log2 p(si )) + 1 = Rmax
i=0
Pn−1
where the entropy function, H(X) = i=0 −p(si ) log2 p(si ), is usually used, giving
Rmin = H(X) ≤ R ≤ H(X) + 1 = Rmax
Given this a message consisting of q symbols with the same probability distribution the message could be represented with a average length, L, bound by
Lmin = qH(X) ≤ L ≤ qH(X) + q = Lmax
The lower bounds on R and L are reached if and only if all probabilities pi can
be written on the form pi = 2−k , where k is a positive integer.
A relatively simple algorithm for constructing such variable-length codes is the
Huffman-algorithm generating the so-called Huffman-codes[15].
4.4.5.4
Composite Variable Length Coding
Given a message consisting of q symbols, M = [X0 , . . . , Xq−1 ], one could represent
all messages of such length with a single codeword of length, L, bound by
Lmin = H(X0 , . . . , Xq−1 ) ≤ L ≤ H(X0 , . . . , Xq−1 ) + 1 = Lmax
because H(X0 , . . . , Xq−1 ) ≤ H(X0 ) + · · · + H(Xq−1 ) (see [18] for proof of this)
this could be re-written as
Lmin ≤ H(X0 ) + · · · + H(Xq−1 ) ≤ L ≤ H(X0 ) + · · · + H(Xq−1 ) + 1 ≤ Lmax
4.5 Source Coding Using the Tiger XS
21
Assuming identical probability distributions (Xi = X, ∀i), as when coding one
symbol at a time in the previous section, gives
Lmin = qH(X) ≤ L ≤ qH(X) + 1 = Lmax
which entails that one might construct a more efficient code than the one presented
in the previous section.
The practical difficulty in assigning one codeword for every possible message
should be apparent, as the number of possible messages is infinite and as translation tables with infinite amounts of entries may not be stored in memories of
finite size. A method that requires only a little memory and by which encoding
and decoding load is only linear to the length of the messages exists in the form
so-called Arithmetic coding. Arithmetic coding has average length bound by
Lmin = qH(X) ≤ L ≤ qH(X) + 2 = Lmax
giving an average rate, R, of
Rmin = H(X) ≤ R ≤ H(X) +
2
= Rmax
q
which is very close to the previously presented limit. [18]
As noted in section 4.4.1, the entropy of a data source forms a lower bound
on the average data rate with which data emanating from this source may be
represented. As arithmetic coding comes within negligible distance of this limit,
the problem of coding data may be considered to be “solved” and compression of
data is reduced to a problem of deducing the probability distribution, P . This is
the prerequisite assumed when compressing data using PPM (see section 4.6.5.6)
or other modern methods.
4.5
Source Coding Using the Tiger XS
After interviewing developers at Sectra, bounds on the resource consumption were
established. These bounds are constituted of two maximum limits; a tenable
maximum, indicating the maximum amount of resources that may justifiably be
consumed, and a hardware maximum, indicating the maximum resources available
in terms of hardware. These bounds are presented in table 4.1.
Given these bounds a strategy as to which source coding algorithms should
be focused on and as to which algorithms and data structures may be suitable to
employ.
As it is assumed that the messages transmitted are but a few hundred characters long, the resource availability in terms of instructions are abundant - several
tens of thousand instructions per decoded character.
Source coding algorithms typically use a lot of memory to maintain statistics
used to encode or decode data. Many, if not most, of the source coding algorithms
presented in the last ten to twenty years have memory consumptions that may
even be troublesome to fulfill on a low-end personal computer.
22
Source coding
Resource
Instructions
RAM
Non-Writable memory
Program memory
a The
b The
Tenable maximuma
10 million
64 kbytes
64 kbytes
64 kbytes
Hardware maximumb
50 million
256 kbytes
256 kbytes
256 kbytes
maximum amount of resources which may justifiably be consumed
absolute maximum amount of resources available
Table 4.1. Maximum computing resources
It is therefore postulated that the memory will form the bottleneck when a
source coding system with preloaded statistics is implemented in the Tiger XS.
The need for preloaded statistics emanates from the static, non-adaptive approach
(see section 4.6.3).
The static approach entails that the statistics does not need to be kept in
writable memory, this means that none of the ordinary RAM needs to be occupied
by statistics.
The following two strategies shall be prevailing in this thesis:
- If a trade-off between processing time and memory requirements exists, memory need shall be minimized within reasonable limits.
- Source coding techniques with excessive memory requirements shall not be
considered.
4.6
Source Coding of Text
Textual data typically has a high degree of redundancy and can easily be compressed.
4.6.1
The Entropy of Text
As noted in section 4.4.1 and further examined in sections 4.4.5.3 and 4.4.5.4 the
entropy of a data source is an important property as it poses a lower bound on
the average compression one can expect to achieve.
There are different approaches as how to estimate the entropy of textual data
sources. Some of those are:
Statistical letter models Treats the text as data sprung from a Markov source.
The model is assumed to have one state for all possible q-grams2 and the
transitions are assumed to be the transitions induced by pushing the next
character into the q-gram. Equivalently the probability of a state is assumed
2A
sequence of q characters from an alphabet
4.6 Source Coding of Text
23
to be the frequency of its associated q-gram and the probability of the transitions assumed to be the probabilities of observing the associated character
following that q-gram. This is further explained in theorem 4.1 and results
of the method applied on actual text can be found in table 4.2.
Statistical word models Treats text as a sequence of words and attempts to
calculate the entropy based on a finite context of q preceding words using
a method identical to that of the letter model with the exception of words
being used instead of letters.
Guessing models Consists of tests carried out with the aid of human subjects,
knowing the preceding letters or words, attempting to guess the next letter
or word in a reference text. Entropy estimations are based on how many
tries the subject needed to correctly guess the next letter. See table 4.3 for
Shannons original results[19].
Gambling models Consists of tests carried out with the aid of human subjects,
knowing the preceding letters or words, gambling on the next letter or word
in a reference text. Gambling models offer the subject the possibility to
bet a variable amount of money depending on their level of certainty. By
allowing the bet to vary in amount the subjects level of confidence in their
predictions is captured - not just what they predict as most likely.
Results from gambling estimates presented by Cover and King display individual gambling results equivalent to between 1.29 and 1.90 bits/char and
collective gambling results between 1.25 and 1.34 bits/char [9]. Like Shannon, Cover and King used Jefferson the Virginian as source of English text.
Theorem 4.1 (Estimating Entropy in a Markov Model)
~ denote a state defined by the q previously observed characters, X
~ =
Let X
{xi−q , . . . , xi−1 }.
~ is approximated as the frequency of the q-gram in
The probability of state, p(X),
the text observed.
The entropy of the Markov model of order q is calculated as:
X
X
~
~
H(xi |xi−q , . . . , xi−1 ) =
p(X)
log2 p(xi = a|X)
, An = {a0 , . . . , an−1 }
~
∀X
∀a∈An
Theorem 4.1 was implemented in software and used to evaluate the entropy in
the language model reference files. The result of this is found in table 4.2. Note
that the method requires a large set of statistics to give a good estimate. For
higher orders this requires a very large reference file, hence Ordo(2) is the highest
Markov source order included in the table ( this is equivalent to studying tri-grams
in the text).
4.6.2
Static and Adaptive Approaches
In source coding, coding of the data is adapted to the source of the data. Several
strategies as how to achieve this exists:
24
Source coding
Ref. File
DN-4
CaP
RodaRummet
Bibeln
Nils
Macbeth
Ordo(0)
4.55
4.46
4.61
4.54
4.53
4.80
Ordo(1)
3.60
3.51
3.53
3.46
3.40
3.50
Ordo(2)
2.79
2.76
2.81
2.63
2.61
2.59
Table 4.2. Entropy estimates - Theorem 4.1 applied on the language reference files.
Results in bits/character.
Model order
Upper bound
Lower bound
Model order
Upper bound
Lower bound
0
4.03
3.19
8
1.9
1.0
1
3.42
2.50
9
2.1
1.0
2
3.0
2.1
10
2.2
1.3
3
2.6
1.7
11
2.3
1.3
4
2.7
1.7
12
2.1
1.2
5
2.2
1.3
13
1.7
0.9
6
2.8
1.8
14
2.1
1.2
7
1.8
1.0
>=100
1.3
0.6
Table 4.3. Entropy estimates - Shannons guessing model applied on Jefferson the Virginian
Static Coding Static coding assumes homogeneous data with the same probabilities throughout the data. Static coding is sensitive to changes in statistics as
it cannot adapt to changes in statistics. When using static coding statistics
have to be known to the receiver as well as the transmitter at the start of
the transmission.
Semi-adaptive Coding Semi-adaptive coding allows for changes in the statistics, this by sending updates of the statistics as sidechannel data. The approach therefore offers the advantage of being able to re-optimize itself if the
statistics change. The cost of sending statistics as sidechannel data could be
very high, depending on how detailed the statistics are.
Adaptive Coding Adaptive coding derives its statistics from previously observed
characters, ie. characters that has already been transmitted. This gives a
poor coding at the start of transmission but it gets increasingly better. Adaptive approaches can also adapt to changes in statistics, all without the need
for transmitting any statistical sidechannel data.
Almost all source coding methods developed in recent years assumes the adaptive approach as this approach forms an efficient general-purpose method of compressing data.
4.6.3
Early Design Decisions
The following assumptions has been made about the messages sent using the Tiger
XS:
4.6 Source Coding of Text
25
1. Messages are composed of about 100–500 characters.
2. Messages are composed of text, most of which is written using a language
known prior to the production of the product.
It is unlikely that an Semi-adaptive approach would yield any reasonable result
as even a slim set of statistics would be bigger than the message itself.
That leaves us with a static approach or an adaptive approach. To investigate
whether an adaptive approach would be suitable an early test was carried out.
A strictly adaptive PPM coder (see section 4.6.5.6) was tested on the reference
message “Jordbävning” (480 characters). The PPM coder used was (“stomp”),
a coder developed by the author a year prior to this thesis. “stomp” achieves
compression performance in line with early predictive coders (good performance).
The outcome of this test was an average coding rate of 5.41 bits/char, a rate
considered to be unfavorable given the source. Indeed, it would later show that
even the most trivial static source coding methods presented in this thesis achieve
better performance than 5.41 bits/char.
The outcome of this early experiment strongly indicates that adaptive coding
is undesired in this situation.
Static coding requires the data to be homogeneous, this is however assumed in
assumption 2 listed above.
The unfavorable rates given by adaptive approaches, in combination with the
assumption that the data to be coded is homogeneous, caused this early postulate
to be adopted:
The coding shall be based on static statistics. These statistics
shall be assumed to be present at both parties at the start of
transmission.
4.6.4
Dictionary Techniques
Dictionary-based compression techniques are among the most intuitive of compression techniques. Mapping one or more characters onto a single codeword of fixed
or variable size dictionary compression offers compression by favoring frequently
used characters by assigning them shorter codewords, by assigning codewords to
frequently occurring combinations of characters or by a combination of both.
To ensure that all possible combinations of characters can be encoded the
dictionary must include all characters as single-character entries.
4.6.4.1
Fixed-length and Variable-length Dictionaries
Dictionaries can be divided in two groups:
Fixed-length codeword dictionaries Achieves compression by encoding a group
of characters at a time.
Variable-length codeword dictionaries Achieves compression by encoding a
single character or a group of characters with codewords of varying length.
26
4.6.4.2
Source coding
Parsing of text for dictionary coding
The process of translating a text into dictionary pointers is generally referred to as
parsing. Given a dictionary and a text to be encoded there are generally several
different ways to parse the text, each of which yields output with potentially
different length.
A common technique is to use so called greedy parsing. When using greedy
parsing the text is parsed using the longest possible string found in the dictionary
that matches the next characters in the text. Greedy parsing is perhaps the most
intuitive parsing strategy and it is relatively easy to implement as it only requires
a few characters at a time to be considered. Unfortunately greedy parsing is not
necessarily optimal, though often good.
Optimal parsing of a text requires the whole text to be considered when parsing, not just a few characters at a time. This clearly is a more complex parsing
strategy. One possible approach is to transform it into a shortest path problem[3]
and then solve it using existing algorithms for solving shortest path problems.
Given a message M = {b0 , . . . , bn−1 } a graph with n + 1 nodes numbered from 0
to n is constructed, where n is the number of characters in the text. A pair of
nodes, i and j, is connected via directed edge if and only if the string [bi , . . . , bj ] is
present in the dictionary. Furthermore the edge is given weight equal to the length
of the corresponding dictionary codeword. The shortest path between node 0 and
node n represents the optimal parsing.
The complex and resource requiring process of finding a optimal parsing can be
made less resource requiring by dividing the text into chunks of data with length
l, where l << n. These smaller segments of text are then encoded optimally.
An alternative to greedy parsing and optimal parsing is the so called Longest
Fragment First (LFF) method. As indicated by its name LFF works by attempting to match the text with dictionary entries checking the longest sequence first
and working its way to the shortest sequences. This method generally achieves
performance better than greedy parsing and worse than optimal parsing and is
primarily effective when using fixed-length codes.
It is worth noting that utilizing advanced parsing strategies only imposes work
and complexity on the encoding party.
The source coding evaluation environment includes methods for greedy parsing
only as it was believed that this method offered the best complexity/performance
trade-off.
4.6.4.3
Evaluated Dictionary Techniques
The design decisions presented in section 4.6.3 essentially reduces the problem to
finding a suitable dictionary given a language reference file. Furthermore, it means
that the work of finding a suitable dictionary can be carried out on a computer.
There is moreover no need for a fast way of deriving dictionaries as it only needs
to be carried out once.
As a result, the dictionary methods described here are constructed to solve
problem of finding a optimal or near-optimal dictionary given a language reference
file and some constraints on coder complexity.
4.6 Source Coding of Text
27
A number of different techniques for composing dictionaries has has been implemented and evaluated. The following sections contain information about these
four standard methods for selecting dictionaries:
• Unigram Coding (section 4.6.4.4)
• Digram Coding (section 4.6.4.5)
• LZW Dictionaries (section 4.6.4.7)
• Wordbook Dictionaries (section 4.6.4.8)
Also, one not so common method (q-gram coding) and two methods developed
exclusively as a part of this thesis have been implemented and evaluated:
• Q-gram Coding (section 4.6.4.6)
• Length Differential Dictionaries (section 4.6.4.9)
• Entropy Differential Dictionaries (section 4.6.4.10)
When reporting the results of the evaluation of the dictionary techniques, the
prerequisites in terms of coding and dictionary size is reported in the form of a
profile. The profile has a prefix of either “FL” or “VL”, used to indicate whether
fixed length coding or variable length coding was used. The prefix is then followed
by a number indicating the size of the dictionary. Example: “VL-1024” is used to
denote a variable length dictionary with 1024 different codewords.
4.6.4.4
Unigram Coding
The simplest of all dictionaries, unigram coding offer no compression unless used
with a variable-length coder. The dictionary is comprised of all single characters,
the frequencies of which are those found in the text.
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
FL-256
8.00
8.00
8.00
8.00
8.00
8.00
VL-256
4.59
4.69
4.59
4.69
4.62
4.80
Table 4.4. Unigram coding performance, measured in bits/char
4.6.4.5
Digram Coding
Digram coding is a simple form of dictionary coding that draws it’s strength from
representing the most commonly used diagrams (pairs of characters) using single
codewords.
Performance of digram dictionaries are displayed in figure 4.3.
28
Source coding
Figure 4.3. Digram coding performance. *Only 1415 unique digrams were found,
yielding a dictionary with 1415 + 256 = 1671 entries
4.6 Source Coding of Text
4.6.4.6
29
Q-gram Coding
Q-gram coding improves upon digram coding by representing an arbitrary number,
q, of characters using a single codeword. The problem of finding the optimal qgrams to be included in the dictionary is known to be NP-complete in the size of
the text[3]. Heuristic methods to near-optimal q-gram dictionaries do exist, the
algorithm presented here is one such method.
The algorithm attempts to find a set of M different equifrequent q-grams of
variable size. The process is complicated by the fact that including a q-gram in
the dictionary will reduce the frequency of those shorter q-grams of which it is
composed of. For each included q-gram frequencies must be updated and q-grams
which frequencies falls below a threshold must be removed from the dictionary.
As the method strives to build a dictionary with equifrequent entries it is well
suited for fixed-length coding.
Performance of this method is displayed in figure 4.4.
Algorithm 1 q-gram selection algorithm
1: Define p(ci ) as the frequency of the q-gram ci in the text.
2: Define Cq = {c(q,0) , c(q,1) , ...} as all combinations of exactly q characters in the
reference text but not in the dictionary.
3: Define max(Cq ) as the q-gram with the highest frequency in Cq .
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
q=2
while Number Of Codewords < Max Number Of Codewords do
if p(max(Cq+1 )) ≤ p(max(Cq )) then
Include max(Cq ) in the dictionary.
Adapt frequencies for all q-grams.
Remove those q-gram in the dictionary with frequency below p(max(Cq )).
If removal causes the frequency of a certain q-gram to jump above
p(max(Cq )) set q = q(that q-gram).
else
q=q+1
end if
end while
4.6.4.7
LZW Dictionaries
LZW is a streamlined version of the LZ78-algorithm originally proposed by Ziv and
Lempel[25] in 1978. It improves upon LZ78 by getting rid of the single character
included in every output word.
LZW is an adaptive algorithm and thus adapts its dictionary as new characters
are encoded. A static version of a LZW can be made by simply feeding the encoder
with characters from the reference text and extract the dictionary when it has
reached the desired size.
30
Source coding
Figure 4.4. Q-gram coding performance
4.6 Source Coding of Text
31
LZW adapts quickly as one new dictionary entry is created for every character
observed in the reference. LZW-dictionaries therefore adapt to only a small portion
of the text and are incapable of utilizing the more than a small portion of the
statistics available in the reference text.
Performance of the LZW-dictionary is displayed in figure 4.5
4.6.4.8
Wordbook Dictionaries
A simple yet effective way to select entries in the dictionary is to use words as
entries. Given a reference text one could easily extract the words present in the
text and include the most common words in the dictionary. The performance of
the wordbook dictionary is displayed in figure 4.6.
Wordbook dictionaries are very sensitive to changes of the language used in
the message.
4.6.4.9
Length Differential Dictionaries
The length differential method was developed as a part of this thesis. The basic
idea of the method is to determine the gain in terms of changes in the length of
the encoded output. The change is estimated for the different q-grams that may
be included in the dictionary and those estimated to cause the most reduction in
size of the output are included. This is done as an iterative process wherein entries
may also be removed from the dictionary if deemed to be ineffective.
The actual gain in including a q-gram in the dictionary in terms of length
of the encoded data is not only depending on the frequency of the q-gram itself
but also on the bits required to encode its characters prior to its inclusion in the
dictionary. This alternative encoding using only q-grams already present in the
dictionary will consist of two or more q-grams of known frequency. The cost of
encoding the q-gram not present in the dictionary can therefore be computed and
compared to the cost of including another q-gram in the dictionary.
The algorithm presented here considers suitable codeword-lengths when selecting q-grams to be included in the dictionary and is therefor adapted for variablelength coding, if variable-length coding is not used q-gram coding as described in
section 4.6.4.6 will be better suited.
Given the following denotation:
• Let p(ci ) be the frequency of the q-gram ci .
• Let D(ci ) be all the q-grams in the alternative coding of ci .
• Let g(ci ) be the gain expressed in bits/char of including the q-gram ci in the
dictionary.
• The optimal codeword length of ci is log2 p(ci ), see section 4.4.5.3.
The relative gain will be:
g(ci ) = p(ci )
X
∀c∈D(ci )
log2 p(c) − log2 p(ci )
32
Source coding
Figure 4.5. LZW-dictionaries coding performance
4.6 Source Coding of Text
Figure 4.6. Wordbook coding performance
33
34
Source coding
An algorithm selecting q-grams based on this relative gain is supplied in algorithm 2. Appropriate values of the two constants IntialSearchCoef f , determining
how many dictionary entries to initially include for each q, and SearchCoef f Increase,
determining how many entries will be added for every increase in q, was found empirically to be −100 and 5, respectively. Those are the values used when measuring
performance.
The performance achieved is reported in figure 4.7, note that while the performance for fixed-length codes is reported as with the other methods this method is
developed only to be used as a variable-length code and hence the result is poor
for fixed-length codes.
Algorithm 2 Length differential selection algorithm
1: Define p(ci ) as the frequency of the q-gram ci in the text.
2: Define n(ci ) as the number of occurrences of the q-gram ci in the text.
3: Define g(ci ) as the gain expressed in bits/char of including the q-gram ci in
the dictionary.
4: Define Cq = {c(q,0) , c(q,1) , ...} as all combinations of exactly q characters in the
reference text but not in the dictionary.
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
q=2
SearchT hreshold = InitialSearchCoef f
while Number Of Codewords < Max Number Of Codewords do
Compute the total gain, G(ci ) = n(ci ) ∗ g(ci ) for ∀ci ∈ Cq
Find Cbest , the set of ci with the highest total gain, Gmax = G(ci )
if Gmax ≥ SearchT hreshold then
Include Cbest in the dictionary
Update frequencies of all dictionary entries
Evict any dictionary entry with frequency zero
end if
if q ≥ M axQ then
SearchCoef f = SearchCoef f + SearchCoef f Increase
if SearchCoef f ≥ 0 then
Break
end if
q=2
else
q =q+1
end if
end while
4.6.4.10
Entropy Guided Dictionaries
This method was a developed as a part of this thesis. Like the Length Differential
algorithm this algorithm considers the total length of the compressed data and
how that length would change as a result of the inclusion of a new dictionary
4.6 Source Coding of Text
Figure 4.7. Length differential coding performance
35
36
Source coding
entry.
The length of the compressed data is given by Ltot = ntot L, where ntot is the
total amount of coded symbols and L entropy of the symbol alphabet, S. L is
given by
X
L=−
p(ci ) log2 p(ci )
∀ci ∈S
The introduction of a new codeword, cx , would affect, Ltot , by giving a change
in ntot and L.
ntot would be changed by a factor given by
X
1) − 1)
nf = 1 − p(cx )((
∀ci ∈cx
where ∀ci ∈ cx are the codewords in the alternative coding of the larger codeword
cx (like coding “b” and “ar” might be the alternative to “bar”).
Ltot would be increased by the introduction of the new codeword
4L+ = −p(cx ) log2 p(cx )
and further affected by the change of statistics on the remaining codewords. We
approximate this change as a change affecting3 ∀ci
X X
4L− = −
p(ci )−p(cx ) log2 p(ci )−p(cx ) − −
p(ci ) log2 p(ci ) =
∀ci ∈cx
=−
X ∀ci ∈cx
p(ci ) log2
∀ci ∈cx
p(c ) − p(c ) i
x
− p(cx ) log2 p(ci ) − p(cx )
p(ci )
giving a total change of
4L = 4L+ + 4L−
This gives an approximate change of data length given by
ntot nf (L + 4L) − ntot L = ntot nf (L + 4L) − L
that is, a relative change given by
nf (L + 4L) − L
This measure is used to decide which q-grams are included in the dictionary.
The implementation of this is similar to the implementation of Length Differential
Dictionaries described in the previous section.
In practice, it turned out that the algorithm was difficult to implement in a
manner as to give it good performance. It was implemented in two stages; the
3 It should be noted that the exact alternative coding cannot be established as it is dependent
on the context (as well as parsing method).
4.6 Source Coding of Text
37
first being a simplified implementation that approximated the relative change as
nf 4L and the second implementing the algorithm in full.
Surprisingly the first implementation achieved much better performance as
the second one consistently chose extremely long text strings as entries. The
performance measurements supplied in the form of figure 4.8 is therefore based on
the simplified version of this algorithm.
4.6.4.11
Implementing Dictionary Coding in the Tiger XS
As the Tiger XS implementation of dictionary coding is of interest, some aspects
of such an implementation will be discussed here.
Assuming that a dictionary with n entries has been generated using some
method, described here or not, the problem of utilizing dictionary coding is reduced
to parsing text and looking up dictionary entries.
It will be assumed that greedy parsing is employed.
When encoding, the problem is finding the longest matching dictionary entry.
Using a simple sorted list to organize the entries would do, but search times may
be long and performance may be aided by providing data structures for searching.
If entries are sorted and accessible by their index the longest matching entry could
be fund using a binary search in at most dlog2 ne accesses. A digital search tree
organized so that branches corresponds to characters could make searches very
fast but at a high cost in terms of memory.
When decoding variable length codes, a binary tree is a suitable data structure. The movements when traversing the tree would of course correspond to the
bits of the encoded message. The chance of encountering the symbol s would
be p(s) and the corresponding codeword length would be d− log2 p(s)e giving the
expected
number of branches needed to be taken before the codeword is found
P
∀s∈S p(s)d− log2 p(s)e. This is the also the coding rate of the symbol coder
measured in bits/symbol. Note that if a symbol corresponds to several characters
the rate bits/symbol is not equivalent to bits/character.
If fixed length codes are used a simple array with strings would provide a simple
and effective data structure for decoding.
4.6.5
Predictive Techniques
Predictive techniques achieve compression by making predictions of what character
is to be seen next and coding accordingly. Each character is given a probability
based on a model of the data source. The character being encoded is then encoded
using some variable-length coding scheme, preferably arithmetic coding.
Although the predictive techniques presented here are described in a text compression context, almost all results here are applicable in situations where data of
arbitrary type is to be encoded4 .
4 Given that the language reference file is tuned to this data and that there is some redundancy
in the data.
38
Source coding
Figure 4.8. Entropy guided coding performance
4.6 Source Coding of Text
4.6.5.1
39
Context Prediction
In context prediction the base of the predictions, the model, is the context in which
the new character appears.
An important concept in context modeling is the order of the context. A context of order n, where n ≥ 0, is the context defined by the preceding n characters.
Additional a context of order 0 and order -1 is defined. The order 0 context derives probabilities from the observed frequencies of characters regardless of their
context. The order -1 context is the equiprobable context wherein all possible
characters are considered equally probable.
Probabilities predicted by a context predictor are the frequencies of which
characters have occurred in that specific context before.
Models of the context and their frequencies are commonly derived from a reference text. Contexts used by the predictors are those found in the reference text.
In most applications the model is based on previously observed characters making
the method adaptive. Context modeling can also be used as a static method by
letting the model be derived from a static text reference known by both the encoding party and the decoding party. In addiction to purely adaptive and purely
static models a combination of the both can be used combing the predictions of
the two models.
4.6.5.2
Blending
Instead of relying on context prediction of a single order, blending models blends
predictions from context predictions of different orders. When combining different
models a relative weight is given to each of the different context orders and their
predictions are re-weighted according to the relative weight.
Given models of order −1 to m and probabilities po (ai ) where o is the order of
the model, modelweights wo are assigned to each model. The blended probabilities
p(ai are then given by
m
X
p(ai ) =
wk p(ai )
k=−1
The modelweights, wo , should be normalized so that
m
X
wk = 1
k=−1
in order to assert that the final probabilities sum to 1.
Blending may also be used to mix probabilities from completely different types
of predictive techniques, not only simple, character oriented context models. When
referring to blending in this section, however, it will only be used to signify blending
of simple context oriented models.
4.6.5.3
Blending Using Escape Symbols
Computing blended probabilities for a large set of models with varying order is a
quite complex task. In addition to the cost of calculating the blended probabilities,
40
Source coding
coding a character using non-zero probabilities for all characters found in the
alphabet is typically costly for the coder.
An intuitive simplification of the previously presented blending scheme exists.
This is built on introducing a new symbol, the so-called escape symbol. The escape
symbol signifies that a lower order model should be used to predict the next symbol
rather than the model used. By starting with the highest context order deemed
significant one could then predict using that single model. If the character being
encoded has not yet been seen in that context an escape symbol is encoded and
the next lower order is used to predict the character. This scheme guarantees that
all characters in the alphabet can be encoded as we can always use escape symbols
to escape down to order −1, where all characters are guaranteed to be predicted
with a probability greater than 0.
A practical approach starts at initial context which has been seen sufficiently
many times to be considered having achieved statistical relevance. If the character
being encoded has been seen in that context the character is coded accordingly. If
the character has not yet been seen in that context an escape symbol is encoded
and a lower order model is used. This process is repeated until a model in which
the character has been seen is reached and the character is encoded. The coding
of an escape symbol is thus equivalent with the next character not having been
seen in the context before.
While being less costly than a full blending, the escape symbol approach might
seem different from the blending presented in section 4.6.5.2, this is, however,
not the case. Given the escape symbol probability, eo , the probability of using a
specific order, o, to predict the character can be expressed as
wo
=
(1 − eo )
m
Y
ei ,
−1 ≤ o < m
i=o+1
wm
=
(1 − el )
and is thus equivalent with the the modelweight, wo as used in section 4.6.5.2. The
modelweights will all be confined to the interval [0, 1] provided that the escape
symbol probabilities are confined to [0, 1]. In this aspect the full blending method
and the escape symbol blending method are equivalent.
4.6.5.4
Escape Symbol Probabilities
It has been concluded in section 4.6.5.3 that the probability of an escape symbol
is the probability of the occurrence within the context of a previously unseen
character.
The problem of finding a proper method for assigning escape symbol weight
is generally referred to as the zero frequency problem. As noted by Bell[3] and
Witten[23] among others there is no theoretical basis for choosing escape symbol
weights optimally. Several different methods exist for selecting escape symbol
weight however.
Let the following describe the prerequisites for a method selecting escape symbol weights:
4.6 Source Coding of Text
41
Let C be a context and A = a0 , .., aq be the alphabet used. Let c(ai ) be the
number of times the character ai has occurred in the context C. Furthermore, let
the total amount of times the context C has occurred be denoted C0 and let the
escape symbol be denoted eC
The following awkwardly named methods have been used to derive escape
probabilities:
Method A is the first of the two methods in the paper introducing PPM[8],
the first practical and effecting blending compression algorithm. The method
assigns the escape symbol the weight 1, giving the escape symbol the weight and
probability
w(eC )
=
1
p(eC )
=
1
C0 + 1
and giving previously observed characters weight and probabilities
w(ai )
= c(ai )
c(ai )
p(ai ) =
C0 + 1
Method B is the second of the two methods in the paper introducing PPM[8].
The method stems from the idea that a character in a particular context is considered novel if it has not yet been observed in that context at least twice. The
novel characters, N , are then given by
N = {ai ∈ A : c(ai ) ≤ 1}
and the reapeated characters, R, are given by
R = {ai ∈ A : c(ai ) > 1}
Escape symbol weights and probabilities increase with the amount of repeated
characters, |R|
w(eC ) = |R|
p(eC ) =
|R|
C0
and giving previously observered characters weight and probabilities
w(ai ) = c(ai ) − 1
p(ai ) =
c(ai ) − 1
C0
Method C is very similar to that in method B but considers all characters
that have been seen in the context as repeated characters; that is
R = {ai ∈ A : c(ai ) > 0}
42
Source coding
giving the escape symbol weights and probabilities
w(eC ) = |R|
p(eC ) =
|R|
C0 + |R|
and giving previously observered characters weight and probabilities
w(ai ) = c(ai )
p(ai ) =
c(ai )
C0 + |R|
Method D is a modified version of method C. The principle introduced in this
method is that for every time the context is seen the weight of all characters as
well as the escape symbol weight shall, in total, increase by one. The introduction
of a novel character therefore increases its weight by only 12 instead of one and
increases the escape symbol weight by only 21 as well. Given the following partition
R
= {ai ∈ A : c(ai ) > 0}
escape symbol weights and probabilities are given by
w(eC )
=
p(eC )
=
1
(|R|)
2
1
2 (|R|)
C0
(4.1)
and giving previously observed characters weight and probabilities
c(ai ) − 21 if c(ai ) > 0
w(ai ) =
0
else
(
1
c(ai )− 2
if c(ai ) > 0
C0
p(ai ) =
0
else
Method P was introduced in 1991 by Witten and Bell[23]. It is based on
a Poisson process model, where the appearance of each token is considered a
separate Poisson process. It is not assumed that the different Poisson processes
are independent. Please see [23] for a full review of the calculations involved.
Let tn be the number of characters that have appeared exactly n times in the
context. The escape symbol weights and probabilities are then estimated as
w(eC ) = t1 −
p(eC ) =
t2
t3
+ 2 − ...
C0
C0
t2
t3
t1
− 2 + 3 − ...
C0
C0
C0
4.6 Source Coding of Text
43
giving previously observed characters weight and probabilities
w(ai ) = c(ai )(1 − (
p(ai ) =
t3
t1
t2
− 2 + 3 − . . . ))
C0
C0
C0
c(ai )
t1
t2
t3
(1 − (
− 2 + 3 − . . . ))
C0
C0
C0
C0
These expressions give probabilities that are not necessarily confined to the
interval ]0, 1[. This will break the coder. Situations in which problems arise are
guaranteed to occur as the first occurrence of the context gives tn = 0, ∀n ∈ N
which in turn gives p(eC ) = 0 and p(ai ) = 0 with the result that no character can
be encoded. Care must be taken to ensure that the results of the given formula
are modified so that the probabilities are confined to the interval ]0, 1[.
Method X is a more practical version of method P where probabilities are
estimated using only the first term. This gives escape symbol weights and probabilities accordingly:
w(eC ) = t1
t1
p(eC ) =
C0
giving previously observed characters weight and probabilities
w(ai ) = c(ai )(1 −
t1
)
C0
t1
c(ai )
(1 −
)
C0
C0
For large n the results are often very close to those given by method P.
As with Method P, one must ensure that the probabilities used by the coder
are on the interval ]0, 1[.
Method XC combines method X with method C. Method X is used when
possible and if the estimated probabilities are not confined to the interval ]0, 1[
method C is used as a fallback. Escape symbols weights and probabilities are given
by
t1
if 0 < t1 < C0
w(eC ) =
|R| else
(
t1
if 0 < t1 < C0
C0
p(eC ) =
|R|
else
C0
p(ai ) =
giving previously observed characters weight
c(ai )(1 − Ct10 )
w(ai ) =
c(ai )
( c(a )
t1
i
C0 (1 − C0 )
p(ai ) =
c(ai )
C0 +|R|
and probabilities
if 0 < t1 < C0
else
if 0 < t1 < C0
else
An empirical comparison of the different methods can be found in section 4.6.5.9.
44
4.6.5.5
Source coding
Exclusion
An important consequence of regarding the escape symbol as a sign of a new,
previously unseen, character is that it implies that the character being encoded
is not one of those that has been seen previously. The probabilities of those
characters as predicted by the lower order context model can therefore be set to
zero. This process is referred to as exclusion.
This process not only ensures that bits are not wasted on indicating that the
encoded character is not a specific character more than once, but it also helps in
minimizing the set of characters predicted and therefore lowers the load on the
coder.
A method referred to as lazy exclusion is employed by some commercial predictive coders. Lazy exclusion is in fact just a term used to signify that exclusion
is not used, instead predictions are made based on the current context only and
failures of higher order context to predict characters are disregarded.
Exclusion in the form of the so-called update exclusion could also be used
when retaining the statistics used for making predictions. A character occurring
in a language reference file exists in many contexts (”y” in ”arbitrary” occurs in
the zero-length-context, the context ”r”, ”ar”, ”rar” and so forth). To update all
contexts in which the character occurs may be the most intuitive way but updating
only the context in which the character was successfully predicted could according
to [3] offer an increase in compression by about 2%.
A special case that sometimes occurs when exclusion is used is the situation
wherein
|Excluded| + |P redicted| = |Allsymbols|
that is, when all symbols have either been excluded or are being predicted at the
current context. All escape methods described in this paper would predict an
escape-probability greater that zero in this situation although the escape probability is obviously zero. The probability of this condition occurring is very low if
a large alphabet is used and the effect may be minimal. Any of the escape probability estimation methods described in this paper could be modified to handle this
specific condition yielding a slight but most likely negligibly better compression.
4.6.5.6
Prediction by Partial Match (PPM)
PPM is a highly effective, context-based text compression algorithm published in
1984 by Cleary and Witten[8]. It blends models of various orders by using escape
symbols to switch to a lower order.
Relative weight of the predictions are given by the number of times the character in question has been observed in that context.
A typical configuration of PPM features escape symbol weights according to
method C, arithmetic coding, a maximum context of 8 characters, and a threshold
of 4. Such a setup typically achieves bit rates near the entropy of the source.
Performance of this setup was measured on the performance reference files as well
as the language reference files and are included in table 4.5. The amount of nodes
relates to the memory requirements (see section 4.6.5.15).
4.6 Source Coding of Text
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
45
PPM-1
2.70
4.72
3.04
3.43
3.19
5.29
3.22
3.93
1 189
13 205
PPM-2
2.14
4.73
2.85
3.25
2.92
5.07
3.04
3.79
10 884
54 835
PPM-3
1.01
4.92
2.95
3.38
3.02
5.14
3.12
3.88
945 663
1 048 504
Table 4.5. Performace and memory requirements of a typical PPM compression setup
(measured in bits/character)
As with dictionary coding, a static approach is taken rather than a adaptive
one (see section 4.6.3). Results presented here are based on prediction from a
static set of statistics derived from the language reference file. It is assumed that
this information is present on the receiver and the transmitter. It is therefore
important that the amount of memory required to store these statistics is kept
reasonably low.
4.6.5.7
PPM Coding
PPM was originally constructed to be used with arithmetic coding but it may just
as well be used with Huffman coding. Huffman codes do not have the ability to
encode characters with less than one bit and thus cannot efficiently use the very
skewed probabilities found in some text and which is often efficiently captured by
PPM predictions. A comparison between the two coding methods when used with
PPM can be found in table 4.6.
It is evident that for messages with language differing from the language reference file Huffman coding is favorable as it generally lessens the cost of coding
sequences that does not occur in the language reference file. Arithmetic coding
also has a slight overhead since the end of the message must be explicitly indicated.
On very short messages, such as Nelson this effect is apparent. As predictions vary
from symbol to symbol when using PPM as a predictor the Huffman codes need
to be rebuilt for every new symbol encoded or decoded, this makes Huffman codes
unsuitable for use with PPM.
4.6.5.8
PPM Maximum Context Order
The choice of maximum context order to be used when predicting is of significance.
PPM schemes are often classified as being one of two types
Bounded context length (finite-context) Which has a maximum context length
(usually around 8). This simplifies implementation and speeds up execution.
46
Source coding
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Arithmetic
1.83
4.77
2.81
3.31
2.89
5.14
3.00
3.78
32 577
102 812
Huffman
2.07
4.67
2.92
3.26
2.96
4.14
3.05
3.59
32 577
102 812
Table 4.6. PPM performance depending on coding (measured in bits/character). Max
Context: 16, Occurrence Threshold: 4.
Tree structures are typically used to efficiently store context statistics, see
section 4.6.5.15 for further implementation details.
Unbounded context length (infinite-context) Which has no maximum context length. When using unbounded context length a deterministic context
(see below) can always be found if such a context exists. In practice implementing unbounded context length prediction requires inclusion of the whole
language reference file. Tree search structures in combination with pointers
into the parsed data are typically used to store context statistics, see section
4.6.5.15 for further implementation details. Unbounded context lengths was
originally suggested in [7].
Studies indicate that for standard methods for deriving escape probabilities and
initial orders there is rarely any gain in using context models with order higher
than 6[7]. This is however very dependent on how the initial order is set and
particularly whether a so called deterministic context exists.
A deterministic context is a previously observed context which is always followed by the same character. Empirical studies [21] indicate that the probability
of a successfully predicting the character being encoded when predicting from a
deterministic context is very high, much higher than in other contexts. Tests with
scaling probabilities in deterministic context surprisingly turned out to reduce
performance.
Setting a lower maximum context length may substantially decrease memory
usage by the algorithm. It also potentially decreases the execution time as less
context searching is needed.
A comparison of compression performance for different context bounds can be
found in table 4.7.
4.6 Source Coding of Text
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Max Context: 4
2.36
4.72
2.92
3.30
3.03
5.24
3.08
3.84
3 998
22 869
47
M.Cont.: 8
1.84
4.77
2.80
3.32
2.88
5.17
3.00
3.79
25 703
87 583
M.Cont.: 16
1.83
4.77
2.81
3.31
2.89
5.14
3.00
3.78
32 577
102 812
M.Cont.: 32
1.83
4.77
2.81
3.31
2.89
5.17
3.00
3.79
32 824
103 250
Table 4.7. Performance and memory requirements depending on maximum context
length (measured in bits/character). Occurrence Threshold: 4.
4.6.5.9
PPM Escape Methods
Selecting an appropriate weight of the escape character is, as previously mentioned,
a problem with no obvious solutions. In the original 1984 paper two different
methods are used: Method A and Method B. An additional five methods: C, D,
P, X and XC are described in section 4.6.5.4.
Of the four non-Poisson oriented (A, B, C and D) typically C performs substantially better than A and B while D offers slightly better performance than C,
although C is more commonly used.
Of the Poisson oriented methods (P, X and the hybrid method XC) method
P, although more complex than method X, offers no substantial improvement in
compression performance. Method XC, combining method X with method C,
generally offers better performance than method X and typically performs slightly
better than method C alone.
A comparison between four of the different escape symbol weighting methods
can be found in table 4.8.
4.6.5.10
Occurrence Thresholds
Contexts that are only found once or a few times in the language reference file may
not be statistically relevant. One could allow contexts whose occurrence count is
below a threshold to be disregarded. This will typically result in predictions from
a context of lower order.
Occurrence thresholds set high may lessen the chances of finding a deterministic context (described in section 4.6.5.8) which may affect the compression performance negatively. Separate occurrence thresholds for deterministic context and
non-deterministic threshold could be used to mitigate this.
If using a forward tree data structure derived from a language reference file
(see section 4.6.5.15) the amount of nodes can be reduced drastically by using
an occurrence threshold, this reduces memory requirements for storing statistics
significantly. Regardless of how statistics for the contexts are stored, reducing the
48
Source coding
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Escape: A
1.91
5.55
3.05
3.67
3.14
5.79
3.29
4.24
10 884
54 835
Escape: B
2.02
4.88
2.90
3.42
2.97
5.12
3.09
3.86
10 884
54 835
Escape: C
2.14
4.73
2.85
3.35
2.92
5.07
3.04
3.79
10 884
54 835
Escape: D
2.07
5.02
2.90
3.46
2.97
5.28
3.11
3.93
10 884
54 835
Table 4.8. Performance depending on escape method (measured in bits/character).
Max Context: 8, Occurrence Threshold 8.
amount of recorded contexts will lessen the memory required to store the recorded
contexts.
A set of graphs depicting compression performance as well as node count as a
function of threshold levels is found in table 4.9.
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Thresh.: 32
2.64
4.73
3.04
3.42
3.13
5.19
3.20
3.90
2 016
20 755
Thresh.: 16
2.42
4.74
2.92
3.39
3.00
5.14
3.11
3.84
4 396
33 425
Thresh.: 8
2.14
4.73
2.85
3.35
2.92
5.07
3.04
3.79
10 884
54 835
Thresh.: 1
1.17
4.91
2.87
3.35
2.96
5.17
3.06
3.85
188 749
268 466
Table 4.9. Performance and memory requirements depending on context occurence
threshold (measured in bits/character). Max Context: 8.
4.6.5.11
Probability Scaling
A prerequisite for achieving good compression is a language reference that does
not deviate too much from that which is to be compressed. If basing the source
coder on a separate language reference model, such as suggested in this paper, and
not on the message itself the probability of a deviation from the model in the data
to be encoded is increased. Although a significant deviation from the language
reference will reduce and perhaps completely undo the effect of the source coding,
the source coding will still be effective when dealing with a small deviation. A
4.6 Source Coding of Text
49
way of dealing with language deviations is to allow for a per-message scaling of
the weight of the escape symbol. Given a scaling factor, d, and non-escape weight
X
w(¬eC ) =
w(ai )
∀ai ∈A
this would give the modified weight w0 and modified probability p0
w0 (eC )
= d ∗ w(eC )
d−1
) ∗ p(eC )
p0 (eC ) = (1 +
w(eC )
1 + d w(¬e
C)
altering the character weights to probabilities to
w0 (ai )
= w(ai )
p0 (ai )
=
1
1+
d−1
w(e )
1+ w(¬eC )
∗ p(ai )
C
The expressions for the probabilities may seem to increase complexity substantially
but the coding calculations all deal with relative weights as given by w and w0 .
Table 4.10 illustrates the effect of scaling the escape symbol counts. Determining an approximate optimum scaling factor for each message could be determined
by an exhaustive search, this is however not practical and an adaptive scaling
might be better. This would be a simpler version of the more extensive adaptive
escape symbol reweighing scheme presented in section 4.6.5.12.
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Scale: 0.5
2.00
4.98
2.88
3.42
2.94
5.23
3.08
3.89
10 884
54 835
Scale: 1
2.14
4.73
2.85
3.35
2.92
5.07
3.04
3.79
10 884
54 835
Scale: 2
2.37
4.59
2.93
3.38
3.01
4.98
3.11
3.78
10 884
54 835
Scale: 4
2.73
4.60
3.15
3.57
3.24
5.02
3.32
3.92
10 884
54 835
Table 4.10. Performance depending on escape probability scaling (measured in
bits/character). Max Context: 8, Occurrence Threshold: 8.
The value of predicting characters from a deterministic context, as discussed
in section 4.6.5.8, offers the possibility of scaling the weight of the single character
being predicted, ai , by a factor, f , such that
w0 (ai ) = f ∗ w(ai )
Table 4.11 illustrates the effect of scaling deterministic predictions.
50
Source coding
Dataset
DN-4
Blair
Diplomatic-1
Jalalabad
Jordbävning
Nelson
Avg. Swe
Avg. All
Inner Nodes
Total Nodes
Scale: 12
2.13
4.74
2.86
3.35
2.92
5.10
3.04
3.79
10 884
54 835
Scale: 1
2.14
4.73
2.85
3.35
2.92
5.07
3.04
3.79
10 884
54 835
Scale: 14
2.13
4.75
2.86
3.36
2.92
5.07
3.05
3.79
10 884
54 835
Scale: 18
2.12
4.77
2.87
3.37
2.92
5.07
3.05
3.80
10 884
54 835
Table 4.11. Performance depending on escape probability scaling for deterministic
contexts (measured in bits/character). Max Context: 8, Occurrence Threshold: 8.
4.6.5.12
Secondary Escape Estimation (SEE)
SEE is an adaptive model for deriving escape symbol probabilities initially developed in 1996. While the traditional, static escape symbol probability estimation
methods are derived using a priori assumptions about the source, SEE is an adaptive scheme constructing contexts of varying order and then blending those. [5]
Like the static methods described in section 4.6.5.4 SEE gathers escape symbol
counts for the active context, escape symbol count being equal to the number of
different characters observed in the context. Unlike static methods the escape
symbol count is used to construct three different contexts which in turn are used
by a separate blended context predictor.
The constructed context is made up of the order, escape symbol count, successful matches count and previous character typecodes. This is compressed using
lossy quantization into a 16-bit, 15-bit and 7-bit context referred to as order 2, 1
and 0 respectively5 . Another way of viewing this is to regard it as three different hashtables (with different hashfunctions) containing escape symbol frequency
counts.
If en is the escape symbol frequency of context order n then the SEE escape
probability is computed as
P
n en wn
p(escape) = P
n wn
where wn is the weight of context n derived accordingly:
wn =
en log2
1
en
1
+ (1 − en ) log2
1
1−en
Assuming that one can make a priori assumptions of the data produced by the
source such as, for example, that it is text from a germanic language, static method
5 These
”contexts” are not to be confused with the text contexts.
4.6 Source Coding of Text
51
will offer good performance, but when coding data with unknown characteristics
SEE could offer a performance boost. SEE is not implemented in the PPM implementation present in the source coding evaluation environment primarily because
of the complexity of this method.
4.6.5.13
Local Order Estimation (LOE)
LOE is a more advanced scheme to determine the order to make the initial prediction for each new character.
Studies to find a reliable predictor for determining the optimal initial order
has been conducted by C. Bloom in [5], among others. A simple yet very effective
predictor was found in the form of the probability of the most common character
for each context, giving the confidence rating, C(n), for each order n according to
C(n) = max p(ai )
∀ai ∈A
and where the initial order oinit is given by the n that maximizes C(n).
4.6.5.14
Preprocessed PPM
A few schemes to preprocess text compressed by PPM exist, all of those utilizing
knowledge of the grammar in the text to parse it. Several such methods are
described in [20].
Methods for PPM preprocessing include:
Capital Conversion Using upper-case/lower-case conversions, described in section 4.6.6.1.
Word Substitution Assigning symbols for words, described in section 4.6.6.3.
Punctuation Conversion Lossy versions of this are discussed in section 4.6.7.2
and 4.6.9.2.
All PPM performance measurements in this section have been made using
Capital Conversion.
4.6.5.15
Implementing Predictive Coding in the Tiger XS
Implementing predictive coding in the Tiger XS requires that a coder is implemented and a predictor is implemented. We will assume that arithmetic coding is
used and that predictions are made by means of the PPM-algorithm as described
above.
Implementing an arithmetic coder is pretty straightforward. Several more or
less optimized implementations exists, some of which may be subject to patent
claims. Arithmetic coding does not require any significant amount of memory.
PPM-predictions require the ability to derive statistics for every possible context (or conclude that the context has not yet been encountered). The Source
Coding Evaluation Environment does this by storing encountered statistics in a
52
Source coding
so-called forward tree. This is not the only available option and other structures
may be more appealing.
Perhaps the most intuitive way of storing the statistics is by storing the entire
language reference file. When statistics for a certain context is needed, one needs
to find all the occurrences of that context in the language reference file and record
what characters the context was followed by, and with which frequency. Performing
even a single such search would require a significant amount of resources and
several such searches would have to be carried out for every character that needed
to be encoded or decoded, making this approach extremely slow. The method
could be speeded up significantly by introducing a separate Ordo(0) frequency
table and a search structure with a set pointers into the language reference file for
every digram. This could be speeded up further by not making Ordo(1) predictions
at all and escaping straight from Ordo(2) to Ordo(0).
A forward tree enables significantly faster searches. The tree is organized with
its base at the root node, representing the Ordo(0)-context. Branches represent
a observed characters in the context associated with the node. Because of this a
branch will also connect to a longer context. All branches also have a frequency
count indicating how many times the character has been observed in that context.
Example: To find the statistics for the context “wind” one navigates from the
root node using the “w”-branch, then on using the “i”-branch, followed by the
“n”-branch and “d”-branch. The branches and their frequencies of the node found
when taking the final “d”-branch gives the statistics for the context “wind”.
Forward trees could be enhanced with so-called wine pointers. Each node in
the tree would be given a wine pointer pointing to the node which represents the
context with exactly one less character in it. Example: the node corresponding to
the context “wind” would have a pointer to the node corresponding to the context
“ind”. This simplifies traversal of the tree and speeds up predictions but at a cost
of higher memory requirements.
To estimate the memory required to store a forward tree we make the following
assumptions:
- A node is comprised of an array of branch identities
- A branch identity is comprised of a character, a frequency count and a
pointer.
- A character is stored in 1 byte of memory
- Frequency counts are stored in 1 byte of memory (quantization may be required)
- Pointers are stored in 4 bytes of memory
- A null-pointer is used to indicate that the branch points to a leaf-node
The memory required to store a branch identity would be 1 + 1 + 4 = 6 bytes.
Since there are as many branch identities as nodes the memory required for a tree
with x nodes would then be 6x bytes.
4.6 Source Coding of Text
4.6.6
53
Preprocessing Text
In order to achieve better compression, various methods of text preprocessing could
be employed. There are several methods for preprocessing text, some of them are
presented here.
4.6.6.1
Capital Conversion
The presence of capital letters at the start of sentences and in other situations can
easily disrupt compression performance as dictionaries fail to find corresponding
entries and predictive as well as state-based models are thrown of track. Solutions
suggested generally deal with converting the character in question to lowercase
and using a special symbol to indicate such as suggested in [20].
Symbols of such type could be:
UppercaseChar-symbol Denoting that the next character should be decoded
as an uppercase symbol.
UppercaseWord-symbol Denoting that all the characters in the next word
should be decoded as uppercase.
UppercaseStart-symbol Denoting that all characters following this symbol should
be converted to uppercase until a UppercaseEnd-symbol is found.
UppercaseEnd-symbol Denoting the end of a series of uppercase characters.
The effect of capital conversion on the performance files has been tested and
the result is available in table 4.12
Coding
PPM: MaxCont=8, Thresh=16
PPM: MaxCont=8, Thresh=4
Dict: Unigram, VL, 256
Dict: Q-gram, FL, 4096
Dict: Wordbook, VL, 1024
Without CC
3.19
3.09
4.62
3.93
3.90
With CC
3.11
3.01
4.55
3.79
3.77
Decrease
2.57%
2.53%
1.6%
3.45%
3.35%
Table 4.12. Capital conversion performance, average rate on messages in Swedish (measured in bits/char).
4.6.6.2
Spell Check
Spelling errors are typically costly to encode as they rarely have corresponding
entries in the dictionary and as predictive as well as state-based models are thrown
of track by them. The introduction of a spell checker could mitigate this and thus
serve a two-fold function, the second function being the improvement in readability.
As to observe the effect of spelling errors on compression, spelling errors were
introduced in the reference files. Approximately one spelling error per every ten
words was introduced. The results of the introduction of spelling errors in text
are displayed in table 4.13.
54
Source coding
Coding
PPM: MaxCont=8, Thresh=16
PPM: MaxCont=8, Thresh=8
Dict: Unigram, VL, 256
Dict: Wordbook, VL, 1024
Normal
3.11
3.05
4.62
3.77
Sp. Errors
3.21
3.16
4.62
3.82
Increase
3.23%
3.82%
-0.06%
1.30%
Table 4.13. Effect of spelling errors on compression, average rate on messages in Swedish
(measured in bits/char).
4.6.6.3
Word Substitution
Assigning symbols to commonly encoded words is a method often used when preprocessing text to be encoded with predictive methods. This, in effect, causes the
preprocessor to capture the benefits of predicting the characters in a word and
allows the predictive coder to predict grammatical structures. As predicting from
long sequences of characters from a narrow alphabet may be cumbersome, the
use of symbols may lessen the work required by the predictor while increasing the
quality of the predictions. Assigning a symbol to a common six letter word will
allow predicting from a one symbol context instead of a six symbol context.
An increase of one or two percent in compression may be expected if word
substitution is used in conjunction with PPM according to [1].
4.6.6.4
End-Of-Line Coding
Textual data is sometimes formatted in rows with a maximum width achieved by
the use of line breaks (a carriage return character followed by a line feed character
or some variation thereof). As this formatting is made independent of the context
itself and typically appears in the middle of sentences it will inevitably hamper
source coding of the text. A simple solution to this may be to simply remove the
extra line breaks and allow the receiving equipment to format the data to fit its
screen, this having other advantages than enhancing source coding.
An interesting approach noted in [1] is moving the EOL character to a separate
substream coded using a variable length code (the authors use Elias codes). It
should be noted that this basic approach achieves no success. For rows of more or
less fixed length one might instead use a variable length code to code the difference
in distance between the EOL characters.
4.6.6.5
Abbreviations
As people often conserve space and reduce the letters required to write common
words by using abbreviations such can be expected to appear in texts being compressed.
Whilst abbreviations may conserve space in terms of number of characters, its
compressed representation may very well end up being longer than the compressed
representation of the unabbreviated term.
The deciding factor is of course whether the model of the language used includes
commonly used abbreviations.
4.6 Source Coding of Text
4.6.7
55
Lossy Text Coding
Lossy text compression is a particularly neglected field in the art of source coding.
This is primarily due to an almost complete lack of method for measuring distortion, but also because text source may not be altered without, possibly, serious
ramifications.
4.6.7.1
Lossy Capital Conversion
Forcing the any letter preceded by any of the punctuation marks “.”, “!” and “?”,
in term followed by a space, to be changed into a uppercase letter will increase
source coding efficiency (and perhaps improve grammar). This would also be very
effective in combinations with capital conversion (see section 4.6.6.1).
4.6.7.2
Lossy Punctuation Conversion
Lossy punctuation conversion is certainly one of the least distorting methods of
lossy text compression. An excellent implementation is the GSM text compression.
The description of this method is found in section 4.6.9.2 and serves as a good
description of lossy punctuation conversion.
4.6.7.3
Synonym Coding
A very humorous and quite impractical approach to source coding text is examined
in [24], where Witten et al. proposes a transform function mapping all words in
a text to the shortest of their respective synonyms using a electronic thesaurus.
The method is reported to achieve compression in the area of about 75% and a
further increase may be yielded when applying standard lossless encoding due to
the reduced vocabulary.
4.6.8
Variable Algorithm Coding
It is important to note that a message may not contain homogeneous data, it
may very well contain data of varying sort, for instance English text mixed with
telephone numbers and text in Italian. Though it in most situations might be
desirable to take the penalty of using a suboptimal compression method rather
than attempting complex solution to mitigate such problems it is fully possible to
switch compression method within the message itself.
As symbols are being encoded and not actual characters codespace could be
reserved for metacharacters representing directives to the predictor, dictionary or
codec. Some plausible directives are listed below
• Directives to change PPM-predictors statistical base (language model file),
for example changing it from a Swedish language model to an English language model.
• Directives enabling switching between different dictionaries when compressing using a dictionary approach.
56
Source coding
• Directives turning on or off the use of non-symmetric predictors making all
characters equally costly to encode regardless of context, for example to be
used when high entropy data such as passwords are to be encoded.
Switching between algorithms employing different codecs may at first seem
cumbersome if even possible. Combining a dictionary coder, which may be assumed to use a Huffman codec, with PPM, which may be assumed to use a arithmetic codec, could be achieved by using an underlying arithmetic bitstream and
extract single bits for use with a Huffman codec. Bits could be extracted from
a arithmetic codec regardless of position in the stream by predicting a 0 and a 1
with a probability of 0.5 respectively. Each encapsulated bit would expand the
size of the bitstream generated by the arithemtic codec by exactly one bit.
In addition to switching between algorithms any given set of predictors may
be combined using a static or a dynamic weighting function. This process would
be analogous to the context blending used in PPM (see section 4.6.5.6) though
blending algorithms using full blending would most likely be a more viable option
than when using several PPM with multiple contexts.
Empirical results achieved by P. Volf [22] have shown that two fundamentally
different source coding algorithms can be combined using a so called switching
algorithm to achieve performance better than what any of the two source coding
algorithms can perform alone. Volf provides several different switching algorithms
with varying complexity and performance.
4.6.9
GSM Text Compression
SMS messages transmitted in a GSM network may be compressed using a specific
algorithm described in the GSM standard 03.42 [12]. This compression method
has also been standardized for use within Universal Mobile Telecommunications
System (UMTS) networks, this is described by 3GPP standard 23.042 [13].
The compression methods used by this standardized algorithm are primarily
adaptive Huffman coding used in combination with a small dictionary coder referred to as keywords, a blending or switching system referred to as character group
processing and a lossy form of text preprocessing called punctuation processing.
A special language indicator is included in the compression header indicating
which language model is used to set up initial dictionaries and Huffman frequency
tables.
4.6.9.1
Coding
The characters are encoded using adaptive Huffman coding. Adaptive Huffman
coding is most often initiated with a small table of characters and their relative
frequencies. Not all characters are present in the table and a special NYT-symbol
(Not Yet Transmitted) is used to indicate that the next character is not previously
observed, this character is then transmitted uncompressed.
The standard specifies several initial Huffman frequency table states for each
language used. A special flag is included in the compression header indicating
which one is to be used.
4.6 Source Coding of Text
4.6.9.2
57
Punctuation Processing
Punctuation processing is a lossy text pre-processing scheme wherein the following
actions are taken
• Leading and trailing spaces are removed.
• Repeated spaces within the message are replaced with a single space.
• Spaces following certain punctuation marks are removed when encoding and
reinserted when decoding.
• When encoding decapitalize the first character of the text as well as characters following appropriate punctuation characters and capitalizing the same
characters when decoding.
• Removing a full stop if present as the last character of the message when
encoding and inserting a full stop at the end of the message when decoding.
These actions are not reversible at the receiver and consequently the message
received is likely to differ from that transmitted, albeit hopefully not in a significant
way.
4.6.9.3
Keywords
A small set of 128 commonly used words are used to form a dictionary that can
be used to achieve compression. The words used in the dictionary depend on the
language as set by the language indicator in the compression header. In order
to encode a word found in the dictionary a special keyword symbol is encoded
followed by a set bits of data indicating which of the keywords was encoded and
also indicating which match options were used.
Available match options are
Case Conversion A switch indicating whether the dictionary entry is to be decoded as all lower-case, all upper-case or first upper-case followed by lowercase letters.
Prefix Match A switch indicating that a special prefix sequence (set on a permessage basis) should be appended to the dictionary entry.
Partial Match A switch indicating that the dictionary entry was only partially
matched. The switch is followed by a numeric field on the form of a generalized golomb code (see section 4.7.1.1) indicating the length of the match.
Suffix Match A switch indicating that a special suffix sequence (set on a permessage basis) should be appended to the dictionary entry.
58
4.6.9.4
Source coding
Character Group Processing
Character group processing divides the different transmittable symbols into three
different character groups. One character appears in one or more of the three
groups. The three different groups are roughly
Group 0 Comprised of all lower-case letters as well as common spacing and punctuation characters.
Group 1 Comprised of all upper-case letters as well as common spacing and
punctuation characters.
Group 2 Comprised of numbers and non-letter-characters such as, for example,
the parenthesis characters.
If one needs to encode a character not found in the group presently selected a
special character group transition symbol is encoded signaling the change of the
presently selected group.
Character group processing is essentially a blending technique and the character group transition symbols can be viewed as escape symbols as in context
modeling.
4.6.9.5
The SMS Compression Algorithm in Practice
An extensive search for performance data for the GSM compression algorithm has
yielded no return. Though no evidence of it is presented here, there appears to
be little or no adoption of the standard for SMS compression on behalf of the
cell phone manufacturers. A quick glance at the standard may also hint this, as
the original 03.42 (released in 1999) standard only specifies parameters for the
English and German languages while all other languages are labeled as “under
development” and this condition prevails in the 23.042 (released in 2004). [12, 13]
One might speculate that the lack of wide adoption of the standard might
be to a poor performance/complexity trade-off, though mobile phone operators
preferring the use of spanned messages might also play a part.
Though implementing and testing the standard would be of great interest, it
would also be quite a resource intense task as the standard incorporates many different elements. Having a unigram adaptive Huffman approach, the performance
of the standard would most likely not be impressive. In fact, given the prerequisites in section 4.6.3 it would have little or no gain compared to the significantly
simpler unigram static Huffman method presented in section 4.6.4.4. The standard
is therefore not not implemented in software and consequently not evaluated.
4.6.10
Other Algorithms
A number of commonly used compression algorithms have not been addressed as
potential algorithms to be used for text compression in the Tiger XS. Below follows
a brief description of some commonly used algorithms not mentioned so far in this
chapter and an account of the reasons for them not being nominated as potential
algorithms in this application.
4.6 Source Coding of Text
4.6.10.1
59
Burrows-Wheeler Transform (BWT)
M. Burrow and D. Wheeler introduced a special transform in 1994. It transforms
an arbitrary sequence of bit structures (usually, and from here on assumed to be,
bytes) into another sequence of such bit structures. The transform is reversible
and expands the data by a few bytes. Given that the sequence of bytes being
transformed has a grammatical structure of some sort the resulting transformed
sequence will have long series of identical bytes. [6]
The BWT does not compress the data transformed, instead it produces a
sequence of bit structures that can be easily source coded. Long sequences of
identical bytes can easily be compressed by a run-length coder or, as suggested by
the inventors [6] of BWT, by a move-to-front coder.
BWT achieves good compression. It typically outperforms Lempel-Ziv-based
techniques and achieves almost as good results as PPM-based techniques. Tables
comparing the algorithm to Lempel-Ziv-based techniques and PPM exist among
others in [6] and [7].
Several improvements on the original algorithm exists, some of them focusing
on the transform itself, many of them focusing on the coder used and others
focusing on text preprocessing.
The transform takes a block of original data and applies the transform onto that
block. In order to achieve good compression a block of minimum 250 kilobytes
is generally used (the inventors used 750 kilobyte blocks for benchmarking [6]).
For this technique to be adapted for small datasets such as short text messages
the message would have to be appended to the entire language reference file and
then the transform would have to be applied to that block of data. The cost, in
terms of computing resources, of coding a small message would therefore be the
same as the cost for coding several hundred kilobytes. Since the main workload
of the algorithm is given by a sorting operation of a quadratic matrix with rowand column-size equal to the block size, an operation which, given modern sorting
algorithms, is of order O(n log n), the computational complexity transform is of
O(n log n) where n is the block size. It should be noted that the matrix could be
partially presorted to speed up the sorting when the message is appended. The
coding of a relatively small message would still come at a computational cost far
greater than of other comparable methods such as PPM. For this reason the BWT
algorithm is not considered as a viable alternative in this thesis.
4.6.10.2
Lempel-Ziv Algorithms
Lempel-Ziv-based algorithms are the most commonly used algorithms today, they
are intuitive, computationally efficient and achieve compression adequate for many
applications. The Lempel-Ziv algorithms come in two versions - LZ77 which compresses by utilizing pointers to data already transmitted, and LZ78 which adaptively constructs dictionaries based on data already transmitted.
Like other adaptive techniques both LZ77 and LZ78 can be made static by
simulating the input of language reference model and then retaining the coder
state. This coder state is then used when coding the message to be sent. But
60
Source coding
unlike other models this approach is not likely to achieve anything better than a
dictionary derived as described in section 4.6.4.4 through 4.6.4.10.
Given the prerequisites assumed in section 4.6.3, that an adaptive approach is
unlikely to offer significant advantage when the data to be encoded is limited to
a few hundred bytes, it follows that the LZ78 approach is not significantly better
than the static LZ78-dictionary construction scheme described in section 4.6.4.7.
LZ77 is not covered by the dictionary coders presented in section 4.6.4.4 through
4.6.4.10. While LZ77 is an interesting algorithm certainly worthy of consideration,
it is unlikely that it will outperform the best of the dictionary-techniques evaluated
in this thesis or that it will outperform PPM. This while having a somewhat large
state - a buffer of the last seen characters, typically tens of thousands bytes long not easily fitted into memory. Only the portion of the language reference file fitted
into this buffer may contribute to the coding performance. To add to the predicament the buffer will have to be accompanied with some kind of indexing-structure
to leviate the burden of searching through the entire buffer when encoding data.
Nevertheless LZ77 offers interesting qualities and it is primarily due to a lack
of time that this method is not evaluated in depth.
4.6.10.3
Dynamic Markov Coding (DMC)
DMC is a method to simultaneously build and use a Markov source model of a
data source. As the model is traversed predictions are made and number of states
are increased by cloning common states.
Bell And Moffat published a note in 1989 proving that DMC, while appearing
to be a finite-state model, it lacks the power of finite-state models and is in fact a
variable-order context model. [2]
DMC performs similar to PPM methods but with slightly different computational efficiency. While DMC generally is faster at encoding and decoding, the very
high number of nodes required for efficient encoding causes implementations to require very large amounts of memory. In the authors original article the reference
implementation, which was used to measure the performance of the algorithm,
created a Markov model with over 150 000 nodes in order to compress a file of
97 393 bytes. Merely storing 32-bit pointers to a 150 000 nodes would require
600 000 bytes of memory. As a Tiger XS implementation would have to make
do with little memory and as speed is not of essence as messages are assumed to
only consist of a few hundred characters, sacrificing memory to gain speed is not
a suitable action.
DMC has relatively high complexity level compared with a dictionary approach
and high memory requirements compared with PPM. For these reasons the DMC
approach is not considered in this thesis.
4.7
Source Coding of Transmission Protocols
As textual data, transmission protocols may also be source coded. Three ways to
source code transmission protocols are described in this section.
4.7 Source Coding of Transmission Protocols
61
Simple Huffman codes could be used to indicate the value of a field with more
than two possible values. The more skewed the statistics for that field is, the more
gain the Huffman coding will offer.
Predictive coding from simple protocol predictors could be employed to code
protocol values. This method is especially effective if arithmetic coding is used
and the protocol statistics are very skewed.
As an example, a set of uncorrelated simple binary flags each having 90%
chance of being set to zero would each have an average cost of:
H(f lag) = −0.9 log2 0.9 + −0.1 log2 0.1 ≈ 0.469
[bits]
enabling the encoding of approximately 2.13 flags per bit.
Golomb coding could be used to code numeric fields in the protocol. Golomb
coding is described more extensively in the next section.
4.7.1
Coding of Numeric Fields
The most intuitive method of representing a numeric value is to assign a fixed set
of bytes, giving a range large enough to accommodate all values that one wishes
to be able to represent, and allowing the combinations of bytes to represent a
number in either the large endian or the small endian way (for example 0->”0000”,
‘1->”0001”, ..., 15->”1111”).
The method described above has two drawbacks. Often the different numbers
are not equifrequent and as described in section 4.4.5 this entails the existence
of a more efficient representation, capable of representing the numbers using less
bits. Furthermore the representation only allows a limited predetermined set of
numbers to be represented (in the example above only 16 different numbers).
At the cost of a slight increase in complexity one can use alternative ways to
represent numbers that does not suffer from these two drawbacks.
4.7.1.1
Golomb Codes
Golomb codes is a common way to efficiently represent numeric values where a
lower number is more probable than a higher number.
Golomb codes are only defined for positive numbers, n, but should one need to
encode negative numbers, z, the following transform from z to n could be applied
2z
if z ≥ 0
n=
2z + 1 if z < 0
this assumes that numbers with lower absolute value are more probable.
Golomb codes utilizes the fact that all numbers, for any given integer m > 0,
can be written on the form
n
= m∗q+r
q
=
where q and r are given by
r
jnk
m
= n−q∗m
62
Source coding
The number n is represented using the two parameters q and r encoded using
different methods. q is encoded using unary coding and r is coded using a common
big endian or small endian representation. The final codeword consists of the
codeword for q followed by the codeword for r. The resulting code (using m = 4)
is illustrated in table 4.14.
Golomb codes are optimal for the probability distribution
= pn−1 (1 − p) when
1
m =
−
log2 p
P (n)
Exponential Glomb Codes are another type of Golomb codes often used. It
utilizes the fact that all numbers, for any given integer b > 0, can be written on
the form
n
= ba + r − 1
where a and r are given by
a = blogb (n + 1)c
r = n + 1 − ba
The complete code is constructed by encoding a and r in the same manner as q
and r when using non-exponential Golomb codes. Exponential Golomb codes are
illustrated in table 4.15.
A simple example of the applicability of Golomb codes in a messaging service is
the use of Golomb codes to represent the three data fields in the User Data Header
(UDH) pertaining to message spanning messages onto several SMS messages. In
this situation one byte is allocated for the message identity, one byte for indicating
how many SMS messages the text message is spanned over and another byte is
used to indicate the number of that specific message. All three fields are binary
encoded and the number zero is reserved. A graph giving the total size in bits for
the UDH as a function of the amount of SMS messages used for spanning, n, is
provided in figure 4.9. It is assumed that the Golomb codes are used to indicate
total number of messages, n. It is further assumed that a variable number of bits
(dlog2 ne), determined by the total number of messages, are used to encode the
number of the specific message. It should be noted that the example is meant to
illustrate the capabilities of the Golomb code rather than advocating sending data
spanned over more than 256 SMS messages, which would be impractical to say
the least.
4.7 Source Coding of Transmission Protocols
n
0
1
2
3
4
5
6
7
8
9
10
q
0
0
0
0
1
1
1
1
2
2
2
r
0
1
2
3
0
1
2
3
0
1
2
q codeword
1
1
1
1
01
01
01
01
001
001
001
r codeword
00
01
10
11
00
01
10
11
00
01
10
63
n codeword
100
101
110
111
0100
0101
0110
0111
00100
00101
00110
Table 4.14. Golomb code with m = 4
n
0
1
2
3
4
5
6
7
8
9
10
a
0
1
1
2
2
2
2
3
3
3
3
r
0
0
1
0
1
2
3
0
1
2
3
a codeword
1
01
01
001
001
001
001
0001
0001
0001
0001
r codeword
0
1
00
01
10
11
000
001
010
011
n codeword
1
010
011
00100
00101
00110
00111
0001000
0001001
0001010
0001011
Table 4.15. Exponential Golomb code with b = 2
64
Source coding
Figure 4.9. UDH-size as function of number of messages. Fixed-length coded messages
may not exceed 256 in number.
Chapter 5
Security
This chapter introduces security concepts, explores the security in GSM, provides
a discourse of the current security in the Tiger XS and discusses extensions of the
current security system.
5.1
5.1.1
Purpose of This Chapter
Purpose
In order to achieve the highest levels of evaluated classification a cryptosystem
has to perform encryption using so-called session keys derived from a preshared
secret and a key agreement protocol. As the current protocol is not session based
it needs to be amended to fulfill the requirements for transmission of secrets at
the highest classification levels.
A message not containing timestamps and being encrypted using a preshared
symmetric key and no session information is susceptible to replay attacks, a security flaw that could potentially have serious ramifications.
The aim of this chapter is primarily to discuss enhancements to the message
security in particular relating to the introduction of so-called session keys along
with the use and setup of session.
This chapter also includes the introduction of a steganographic protocol powered by PPM and discussion regarding its efficiency.
5.1.2
Prerequisites
In this chapter it will be assumed that the channel used to transmit messages is
SMS. SMS message exchange is very far from an interactive protocol and the steps
and measures needed to establish a session must be kept at a minimum.
65
66
5.1.3
Security
Limitations
No implementation of the techniques used for session establishment discussed here
will be provided. An implementation of the steganographic protocol discussed in
this chapter is however constructed and evaluated.
5.2
Definitions
The following terms are used extensively throughout this chapter:
Key A secret bit-vector which is used encrypt and decrypt data.
Plaintext Referring to the data that is to be encrypted. This data does not have
to consist of text and is often assumed be digitalized voice communication,
digital images and other non-textual data.
Ciphertext Referring to the encrypted data - the output of the crypto system.
Cipher An algorithm or set of algorithms to encrypt and decrypt data.
Cryptosystem A system comprised of cipher algorithms and/or other cryptographic algorithms that achieves one or more cryptographic objective.
Eavesdropper A person secretly intercepting communication.
5.3
Security Basics
Cryptography can be used to achieve confidentiality, to detect tampering with
messages and to assert the identity of the party with which one is communicating.
The Tiger XS uses all of these security features which are all essential components
of a secure communication system. This chapter will introduce some basic cryptography concepts, asses how this relates to the Tiger XS and use this to explore
extensions to the cryptographic functions of the system.
A cryptographic system, or cryptosystem, is a set of algorithms used to achieve
one or more cryptographic objective (see section 5.3.2). The algorithms used are
often standardized and publicly available, such algorithms or the mathematical
methods on which they rely are often referred to as cryptographic primitives.
An important premise in cryptography is Kerchkoff’s principle. Kerchkoff’s
principle states that when assessing the security of a cryptosystem one should
always assume that the system and algorithms of which it is comprised of is known
by all trying to break the system. In other words, the security of the system should
only be depending on the secrecy of the key and not the secrecy of the system itself.
Typically keeping a cryptosystem design secret while using it will fail in practice,
an example of this is the A5/1 cipher examined in section 5.4.
A cryptosystem often uses a cipher. Two important traits of a cipher are
confusion and diffusion. Confusion entails that the cipher uses non-linear operations that are typically more resilient to crypto analysis. Diffusion entails that
any change in plaintext or key bits should cause an output which bits are not
correlated at all to the output prior to the change.
5.3 Security Basics
5.3.1
67
Possible Attacks
An attack on a cryptosystem requires data. The type of data that is needed for
a specific attack is often used to categorize the attack itself. The following four
categories are generally used:
Ciphertext only Where an eavesdropper has access to the ciphertext only.
Known plaintext Where an eavesdropper has access to the ciphertext as well as
the plaintext, this also include the situation where an eavesdropper has access
to only parts of the plaintext (as for example knowing that the ciphertext
begins with ’Hello’).
Chosen plaintext Where an eavesdropper has temporary access to the cryptosystem itself, but no means to extract the key, and may encrypt any message preferred and gain access to the corresponding ciphertext.
Chosen ciphertext Where an eavesdropper has temporary access to the cryptosystem itself, but no means to extract the key, and may decrypt any chosen
ciphertext and gain access to the corresponding plaintext.
If a cryptosystem is to be considered secure none of the attacks listed above should
be sufficient to retrieve the key. A system which allows the key used to be retrieved
using one of the attack listed above is said to be susceptible to that type of attack.
There are a number of possible ways to attack a cryptosystem, some of the
more frequently discussed are the following:
Message Deciphering A message may be deciphered revealing the contents of
the message.
Message Spoofing A message may be produced that appear to be a legitimate
message sent from an authorized user.
Man-In-The-Middle (MITM) Attack In a MITM attack on the communication between user A and user B a third party presents himself as user B
to user A and vice versa letting the users believe they are communicating
directly by passing on messages. Messages could be slightly altered before being passed on to compromise public-key security systems (see section 5.3.3).
Communication between parties that have no preshared secret is typically
vulnerable to MITM attacks.
Replay Attack In a replay attack a authentic and legitimate message, or part
of a message, is captured and resent at a later time potentially triggering
events. An example of this would be to capture an encrypted message saying
”I’m at the gate, please open it” and then resend it at a later time fooling
security to open the gate for someone else.
Denial-of-Service (DoS) Attack In a DoS attack a malicious party attacks a
system, often by attempting to overwhelm it with false service requests, with
the intent to cause loss of service to the users of the system.
68
Security
5.3.2
Cryptography Objectives
The objectives of a cryptographic system is often categorized as being one or more
of the following:
Confidentiality To assert that only the intended parties can gain access to the
message sent.
Data integrity To assert that the message is not in any way tampered with.
Authentication To assert that the parties communicating are those who they
claim to be.
Replay prevention To assert that the message is not a replay of a previously
sent authentic message.
Non-repudiation To assert that a user cannot claim to not have sent a message
sent by that user.
5.3.3
Symmetric and Asymmetric Ciphers
Ciphers are often divided into symmetric and asymmetric (or public-key) ciphers.
Symmetric ciphers refer to ciphers where encryption and decryption are carried out using the same key. Symmetric ciphers are generally divided into two
categories: block ciphers and stream ciphers (see section 5.3.4).
Symmetric ciphers generally have keys that are randomly picked from a set
of between 264 and 2256 different keys (referred to as 64-bit and 256-bit keys
respectively) each key with the same probability of being picked. Sufficient security
for almost all applications can generally be achieved with a 128-bit key given that
128
on average 2 2 = 2127 ≈ 1038 keys have to be searched before the correct key is
found if a so-called exhaustive search1 is performed.
Commonly used symmetric algorithms include Data Encryption Standard (DES),
Advanced Encryption Standard (AES) and Rivest Cipher 4 (RC4).
Asymmetric (or public-key) ciphers have separate keys for encryption and
decryption. The key used for encryption is generally referred to as the public key
and the key used for decryption is generally referred to as the private key. The
use of symmetric ciphers require the communicating parties to possess a shared
secret (the key). Using asymmetric ciphers it is possible to communicate securely
with people with which one does not share a secret with.
Asymmetric algorithms is also used for purposes other than confidentiality such
as for authenticating (via digital signatures) and non-repudiation.
Asymmetric algorithms tend to be much more slow in terms of throughput
than symmetric algorithms, often by several orders of magnitude. Because of this
difference in performance purely asymmetric systems are rarely used but a so-called
hybrid systems is used instead. Hybrid system employ a symmetric algorithm to
carry out the encryption of the actual data, the symmetric key is then encrypted
using an asymmetric algorithm and prepended to the message. Algorithms that
1A
method wherein keys are tested sequentially until the correct key is found.
5.3 Security Basics
69
use the properties of asymmetric ciphers yet only serve to establish a shared secret
are generally referred to as key-exchange algorithms.
Key lengths for asymmetric ciphers are generally significantly much higher
than with symmetric ciphers. The higher key length is due to that not all keys
are equiprobable. Searching for keys is carried out in structured ways as the
algorithms forces the keys to have certain numerical properties. Key lengths are
generally between 512 and 4096 bits for integer factorization-based algorithms and
between 128 and 512 bits for elliptic curve-based algorithms.
Commonly used asymmetric algorithms include Rivest-Shamir-Adleman (RSA),
Elliptic Curve Cryptography (ECC) and Diffie-Hellman (DH).
5.3.4
Block Ciphers and Stream Ciphers
Symmetric ciphers are often divided into one of two categories: block ciphers and
stream ciphers.
Block ciphers take a key and a block of plaintext, often between 64 and 256
bits in length, and transform that to an equally long block. The operation is
reversible if and only if one has access to the key used when encrypting. Block
ciphers use different modes. Different modes have different properties in terms of
error-propagation, security, parallelization, etc. Block cipher operation is depicted
in figure 5.1.
Figure 5.1. Block cipher operation
Stream ciphers use a key to set up a secret internal state, which is used to
drive a state machine and output a sequence of binary data referred to as the key
stream. The key stream is then XOR:ed to the plaintext to derive the ciphertext.
Stream cipher operation is illustrated in figure 5.2.
Figure 5.2. Stream cipher operation
70
Security
Block ciphers are generally considered more secure than stream ciphers but
stream ciphers typically have better performance and simpler hardware implementation, therefore stream ciphers are commonly used in communication systems
that employ mobile units and real-time services while systems that are immobile
or have strict security requirements often use block ciphers.
5.3.5
Initialization Vectors
Encrypting using nothing but a static key can cause the cryptosystem to leak information. An obvious example of information leakage is when the same plaintext
is encoded at two different times yielding the same ciphertext both times. An
eavesdropper knowing the contents of the first message would instantly know the
contents of the second message and even if one does not know the plaintext of
either message the fact that they are known to be identical provides information
to the eavesdropper.
The use of an Initialisation Vector (IV) mitigates this. An IV is a bit-vector
used with both stream ciphers and block ciphers. The IV is used to alter the internal state in a stream cipher or to alter the starting state of a block cipher causing
the same plaintext encrypted using the same key to yield completely different
ciphertexts.
The IV is generally considered to be public and is either appended to the
message or constructed from other data known to both the sender and the receiver
such as the message communication protocol fields. An IV should be constructed
so that it rarely or preferably never repeats, one that is constructed to never repeat
is often referred to as a Number used ONCE (Nonce).
5.3.6
Cryptographic Hash Algorithms and MACs
A hash algorithm depicts a dataset of arbitrary length onto a fixed length bitvector. A cryptographic hash algorithm does so by using an irreversible function
which is also assumed not to leak any information about the original dataset.
The purpose of a cryptographic hash algorithm is to protect messages against
tampering, ie. inserting, changing or removing data in the message. Normal use
involves computing the hash value (often called message digest, digital signature
or simply hash) and appending that to the message. When receiving the message
the recipient computes the hash value of the message and verifies that it is the
same as the hash value attached to the message.
An obvious flaw in the scheme described above is the possibility of an attacker
changing the message, recomputing the hash value and replacing the original hash
value with the new one assuring that the message is once again valid. Two ways to
protect against this exists. The first is simply to cipher the message and the hash
value or just the hash value. Given that the cipher is not broken the hash value
could be assumed to protect the message from being tampered with. The second
method is to use a so called Message Authentication Code (MAC) typically using
the keyed-Hash Message Authentication Code (HMAC) algorithm. Constructing a
MAC using HMAC involves computing a hash value for what is roughly a message
5.3 Security Basics
71
with a secret key prepended to the message. A MAC constructed in this message
can only be verified and constructed by those parties with knowledge of the key.
An important situation occurring with cryptographic hash functions is a so
called hash-collision. A hash-collision is said to occur when two messages, m1
and m2 have the same hash values, h1 = hash(m1 ) and h2 = hash(m2 ). It is
important to note that hash-collisions are a necessity. Consider for example all
messages with 150 bytes (1200 bits) and their respective 160-bit hash values, it is
obvious that there are more plausible messages than hashes (21200 messages, 2160
hash values) and consequently several messages must share the same hash value.
1200
In fact one can expect about 22160 = 21040 messages to share any given hash value.
Hash-collisions may be unavoidable but for a good cryptographic hash function
they should never the less be computationally infeasible to find. The following
three traits are important for a cryptographic hash function:
Preimage resistant For any given hash value, h it should be computationally
infeasible to find m such that h = hash(m).
Second preimage resistant For any given message, m1 it should be computationally infeasible to find m2 such that hash(m1 ) = hash(m2 ).
Collision resistant It should be computationally infeasible to find two messages,
m1 and m2 , with the same hash value, h = hash(m1 ) = hash(m2 ).
Furthermore, for a given message, m, the hash value, h = hash(m), should be able
to be computed as quickly as possible.
Commonly used hash algorithms include Message-Digest algorithm 5 (MD5),
which is now considered completely broken and Secure Hash Algorithm 1 (SHA-1)
which is known to have flaws though not as critical as MD5.
5.3.7
Challenge-Response Schemes
Challenge-Response schemes are used to authenticate users. The basic assumption
is that only authorized users have access to a secret key. This key could be the
private key of an asymmetric cryptosystem or a symmetric key shared by a group
of people. In order to authenticate the user one needs to assert that the user has
access to the key.
The simplest method of asserting that the user has access to the key is, of
course, to simply ask the user to present the key. This scheme is flawed as anyone
listening in as well as the user performing the authentication gains access to the
key.
Another method would be to present a mathematical problem which is to be
solved using the key and a specific random number referred to as the challenge,
only the user with correct key would have the correct response to that problem.
The problem could be constructed in a way that would make it overwhelmingly
difficult to derive the key used from the challenge and the response, this way an
eavesdropper would not gain access to the key yet the authentication could be
carried out.
72
Security
The scheme might look like the following where f is a function with cryptographic hash-like structure, ie. its inverse is difficult to determine. Furthermore
let c be the challenge and k be the key, the response, r, would then be
r = f (c ⊕ k)
Assuming that knowing f (x) yields no substantial information about x an eavesdropper could not gain access to the key, yet everybody with access to the key
could verify other users by determining whether the r they supplied is correct. This
scheme assumes that the same c is never used twice as this would compromise the
system.
Similar systems exists for asymmetric schemes using the private key. This
enables the authentication to take place even though only the authenticating user
has access to the private key.
5.4
GSM Security
GSM offers basic over-the-air security, hampering third-party attempts to misuse
the network.
5.4.1
GSM Security in Brief
GSM security basically offers the following three security features:
User Authentication The user is authenticated by the base station.
Data Confidentiality The over-the-air transmissions are ciphered to prevent
eavesdropping.
User Identity Confidentiality The user authentication process is constructed
not to reveal the identity of the user. This is aimed at reducing the possibility
for a third party to monitor the movements of a user.
These features are realized using the following components:
IMSI A 15 digit number uniquely identifying a user.
Ki A secret 128-bit preshared key.
A3 An challange-response authentication algorithm taking a 128-bit vector (called
RAND) and outputting a 32-bit vector (called SRES).
A5 A set of cipher algorithms. A5/1 - the original “strong” cipher. A5/2 - a
deliberately weakened algorithm. A5/3 - a newer, improved algorithm.
A8 A session key generation algorithm.
SIM A smart-card loaded with the preshared secrets and required algorithms.
5.4 GSM Security
73
The security mandates that the Ki cannot be read by the phone. The SIM-card
therefore contains a micro-processor to run the A3 and A8 algorithms using the
Ki and IMSI. The SIM-card will not cooperate unless the user supplies the correct
PIN-code.
The A8 algorithm is run on the SIM-card to generate the session key. The
session key is then used by the phone along with the frame number to set up the
state of the A5 cipher.
The A5/1 and A5/2 algorithms are stream ciphers based on several LFSRs
with irregular clocking determined by non-linear functions. The A5/3 is a block
cipher based on Feistel networks.
The functions of A3 and A8 are among most operators realized by a single
algorithm called COMP-128 (three versions exists v1, v2 and v3 - the latter being
uncommon).
5.4.2
Vulnerabilities
The GSM security has a lot of publicly known flaws, some of which are very serious.
They are presented below.
GSM does not provide a mechanism for authenticating the network to the user.
This allows anyone to set up a fake base station, gathering information about users
and/or attacking the A3 algorithm using a chosen-plaintext attack, among other
things.
The GSM protocol allows the base station to issue a so-called IDENTITYREQUEST requesting the identity of the user. In particular a request for the IMSI
number may be issued. This will be automatically answered by the users phone,
as mandated by the protocol, revealing the identity of the user and coupling the
publicly used temporary identity (TMSI) with the non-public IMSI. By imitating
a legitimate base station an attacker may achieve this.
The COMP-128v1 algorithm commonly used for authentication and session
key generation has serious flaws. Using the full entropy of the randomly chosen
Ki an attack on the COMP-128v1 would have had a complexity of as much as
2128 , had there been no more efficient attack than an exhaustive search. Attacks
on the COMP-128v1 algorithm with complexity as low as about 214 have been
discovered. Such an attack recovers the Ki from the target and therefore entails a
complete compromise of the security.
Combining the lack of base station authentication and the flaws in COMP128v1 enables an attacker to recover the Ki in an over-the-air attack not requiring
physical access to the phone and its SIM-card.
Unlike many other transmission systems, GSM applies channel coding before
encrypting it, giving a plaintext with a high degree of redundancy. A 1/2-rate
convolutional code that may or may not be punctured is commonly used enabling
a partially known-plaintext attack. Indeed, some of the published attacks on the
A5 ciphers utilize this flaw to achieve known-plaintext attacks under what is in
practice ciphertext-only scenarios. [17]
The A5/2 cipher, which was intentionally made weak, can easily be attacked using a ciphertext-only attack on a standard PC in under a second [11]. The stronger
74
Security
A5/1 proved to be a bit more complicated to defeat but known-plaintext attacks
were published as early as 2001 [4] and ciphertext-only attacks were presented in
2003 [11]. Attacks on A5/1 demand a substantial amount of preprocessing and
storing of that data, the attacks enables time/memory/data tradeoffs but several
dozens terabytes of storage space may be required. The weakness of the A5/2
cipher may be utilized by an attacker posing as a base station but without knowledge of Ki directing the user to use A5/2 instead of A5/1 and thereby simplifying
cryptoanalysis.
The security in GSM leaves much to be desired. Several measures have been
taken to mitigate the problems in the system, improved versions of COMP-128
(COMP-128v3) and A5 (A5/3) have been developed and are mandatory in 3Gsystems. Channel coding is applied after encryption when using General Packet
Radio Service (GPRS) and UMTS employs authentication of the network. [17]
It is however important to remember that no widely available wireless telephony service, GSM or other, provides endpoint-to-endpoint encryption and authentication. This is a security offered only by the Tiger XS and other similar
products.
5.5
EMSG
The EMSG protocol enables transmission of messages with payload up to 65478
bytes in length, this is however in practice limited by the transmission channel,
which in the case of SMS limits payload length to 82 bytes. The protocol provides confidentiality using an interchangeable cipher, integrity verification and an
implicit authentication (though only authenticating the sender). The channel is
assumed to be one-way only and transmission incorporates a single message without acknowledgment. The message transmission is depicted in figure 5.3.
Figure 5.3. EMSG transmission
5.6
Extending EMSG
In order to achieve the highest levels of evaluated classification a cryptosystem
has to perform encryption using a so-called session keys derived from a preshared
5.6 Extending EMSG
75
secret and a key agreement protocol. As the EMSG protocol is not session based
and as it only utilizes a preshared key it does not fulfill the requirements for
transmission of secrets at the highest classification levels.
A message not containing timestamps and being encrypted using a preshared
symmetric key and no session information is susceptible to replay attacks, a security flaw that could potentially have serious ramifications.
In order to facilitate a method for transmission of messages of higher security classification levels the EMSG protocol has to be adapted to accommodate
the requirements of such classifications. This section aims to explore three methods for deriving and utilizing session keys. One assisted by a Key Distribution
Center (KDC), one that is peer-to-peer oriented aimed at direct, unassisted communication between mobile units and one utilizing the speech channel.
The objects of the extended EMSG protocols shall be the following:
Confidentiality. The message payload shall be encrypted.
Authenticity. It shall be verified that the sender of the message is an authorized
user.
Ephemeral keys. The ciphering shall be carried out using keys derived from a
shared secret as well as a agreed upon session key.
Payload Integrity. The payload shall be protected from tampering.
Although using keys unique to a specific session to transfer messages over an
unsecure channel these keys cannot be discarded at the end of the session if the
messages are saved in the device memory as received. Saving messages for viewing
at times other than the exact time when it was received requires one of the following
methods to be used:
- The message can be saved encrypted, the encryption being that which was
used during transmission. This method requires the session key to be stored
along with the message until the message is deleted, the session key may
very well be stored encrypted.
- The message can be re-encrypted using another key and stored, which allows
the session key to be destroyed.
5.6.1
Centrally Assigned Session Keys
To explore a centrally assigned session key protocol a single protocol prototype
is presented which regulates the structure of the traffic between the sending user
(denoted user A), the receiving user (denoted user B) and the Key Distribution
Center (KDC). The protocol is based on the widely used Kerberos protocol.
The protocol assumes the existence of a set of secret keys, referred to as user
keys, known only by the KDC and each one of the users respectively. A user key
is denoted KKX , where X is replaced by the letter associated with the user (for
example KKA for user A).
76
Security
The communication consists of a trifold message exchange, of which two messages are sent between the sending party (referred to as user A) and the KDC and
one message between the sending user and the receiving user (denoted user B).
The first message is sent by the sending user (denoted user A) to the KDC in the
form av a EMSG-KREQ message requesting a key to be used for communication
between user A and B. The EMSG-KREQ is then followed by a EMSG-KRES
message sent by the KDC to user A, issuing a key for communication between
user A and B. Finally a message containing the actual payload is sent from user
A to user B. This process is depicted in figure 5.4.
Figure 5.4. Kerberos Extended EMSG transmission
5.6.1.1
EMSG-KREQ
The EMSG-KREQ is sent by the sending user, user A, to the KDC, requesting
a session key to be used when communicating with another user, user B. The
message is sent unencrypted and contains the identity of user A and B. Identities
are assumed to be limited to 32 bytes. The length of the identities is adapted to
accommodate phone numbers as identities.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x04 for EMSG-KREQ.
User A Identity Len The length in bytes of the field identifying user A.
User A Identity The data used to identify user A.
User B Identity Len The length in bytes of the field identifying user B.
User B Identity The data used to identify user B.
5.6.1.2
EMSG-KRES
The EMSG-KRES is sent by the KDC in response to a key request in the form of
a EMSG-KRES message. The EMSG-KRES is sent if and only if user A and user
B are authorized to communicate securely as configured in the KDC. A security
5.6 Extending EMSG
Field
EMSG Type
User A Identity Len
User A Identity
User B Identity Len
User B Identity
Total Length
77
Field length
1
1
LA
1
LB
3 + LA + LB
Table 5.1. EMSG-KREQ message format (field length in bytes)
level indicator is also determined and appended to the message indicating at which
security level the users may communicate.
The EMSG-KRES contains the session key, the session key temporal properties,
the communication security level indicator and the user id:s. The protocol is
adapted to accommodate user id:s with a maximum length of 32 bytes and 192bit symmetric session keys, a restriction prompted by the limitations of the SMS
channel.
The contents of the message are displayed in table 5.2.
Protocol fields present in this message are (keys in parenthesis signals which
keys have been used to encrypt the field, and in what order):
EMSG-Type The type of EMSG, shall be set to 0x05 for EMSG-KRES.
Key Timestamp (KKA ) A timestamp indicating the start of the key validity
period.
Key Validity (KKA ) A length of time indicating the length of the key validity
period.
Session Key (KKA ) The session key used for communication between user A
and B, denoted KAB .
User B ID Len (KKA ) The length in bytes of the field identifying user B.
User B ID (KKA ) The data used to identify user B.
Key Timestamp (KKB , KKA ) A timestamp indicating the start of the key validity period.
Key Validity (KKB , KKA ) A length of time indicating the length of the key
validity period.
Session Key (KKB , KKA ) The session key used for communication between user
A and B, denoted KAB .
User A ID Len (KKB , KKA ) The length in bytes of the field identifying user
A.
User A ID (KKB , KKA ) The data used to identify user A.
78
Security
Field
EMSG Type
Key Timestamp (KKA )
Key Validity (KKA )
Session Key (KKA )
User B ID Len (KKA )
User B ID (KKA )
Key Timestamp (KKB , KKA )
Key Validity (KKB , KKA )
Session Key (KKB , KKA )
User A ID Len (KKB , KKA )
User A ID (KKB , KKA )
Total Length
Field length
1
4
2
24
1
LB
4
2
24
1
LA
63 + LA + LB
Table 5.2. EMSG-KRES message format (field length in bytes)
5.6.1.3
EMSG-KDAT
The EMSG-KDAT carries the payload as well as a the session key and its attributes. The session key and its attributes remains encrypted using KKB , known
only to the receiver and the KDC. The contents of the message is displayed in
table 5.3.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x06 for EMSG-KDAT.
Key Timestamp (KKB ) A timestamp indicating the start of the key validity
period.
Key Validity (KKB ) A length of time indicating the length of the key validity
period.
Session Key (KKB ) The session key used for communication between user A
and B, denoted KAB .
User A ID Len (KKB ) The length in bytes of the field identifying user A.
User A ID (KKB ) The data used to identify user A.
Payload Length (KAB ) The length of the payload.
Payload (KAB ) The payload.
MAC (KAB ) A Message Authentication Code used to verify the integrity of the
payload. The MAC is computed on the payload length and the payload
itself.
5.6 Extending EMSG
Field
EMSG Type
Key Timestamp
Key Validity
Session Key
User A ID Len
User A ID
Payload Length
Payload
MAC
Total Length
79
Field length
1
4
2
24
1
LA
2
LP
32
66 + LA + LP
Key(s) used to encrypt
none
KKB
KKB
KKB
KKB
KKB
KAB
KAB
KAB
-
Table 5.3. EMSG-KDAT message format (field length in bytes)
5.6.1.4
EMSG-FAIL
The EMSG-FAIL message is sent to signal an error, this message is optional and
omitting sending this might reduce the impact of DoS attacks. The contents of
the message is displayed in table 5.4.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x00 for EMSG-FAIL.
Session ID The identity of the failed session, shall be set to 0x0000 if no session
identity is available.
Error Code Providing the response to the challenge.
Error Message Length The length of the optional error message.
Error Message An optional error message.
Field
EMSG-Type
Session ID
Error Code
Error Message Len
Error Message
Total Length
Field Length
1
4
2
1
LE
8 + LE
Table 5.4. EMSG-FAIL message format (field length in bytes)
80
5.6.1.5
Security
Channel Issues
Both the EMSG-KREQ and the EMSG-KRES messages are adapted to fit within
a single SMS message given that the identity length is limited to 32 bytes. The
EMSG-KDAT message could be fitted into a single message given that LA + LP ≤
72 bytes, as it is probable that this is not the case it is likely that the EMSG-KDAT
message will have to be transmitted using two SMS messages.
Extending the maximum length of the session key to more than 192 bits would
require the EMSG-KRES message to be transmitted as two SMS messages, it
would also extend the size of the EMSG-KDAT message.
An advantage, compared to the peer-to-peer approach described in section 5.6.2,
is that the receiving device does not have to be accessible for a key-negotiation.
This enables the full process of sending a message to be concluded even though
the receiving device is offline.
Its disadvantage is that the protocol relies on the KDC being reachable to
function. If the KDC cannot be reached no session keys can be issued. Therefore
it is desirable to provide a fallback mechanism, possibly using this protocol as
primary method and using the peer-to-peer approach described in section 5.6.2 as
a fallback method.
5.6.1.6
Security
The protocol relies on the secrecy of the user keys (KKA and KKB in the previous
sections). Given that these are kept secret it would be difficult to derive the
session key, KAB . The secrecy of the session key assures the confidentiality of the
communication.
The possession of the session key encrypted using KKB as well as the session
key in plaintext implicates that user A has been authorized to send messages to
user B by the KDC. The validity of these two keys could be verified using the
MAC.
This protocol is susceptible to replay attacks within the validity period of the
session key. The window for executing a replay attack could be made smaller
by reducing the key validity period. A longer validity period would mean that
several messages could be sent using the same session key, reducing the amount of
communication with the KDC needed.
Security from man-in-the-middle attacks is provided as preshared secrets are
used to encrypt session traffic.
The protocol relies only on symmetric algorithms, an advantage as asymmetric
algorithms tend to be less trusted.
A disadvantage is that this method requires the users devices clocks to be
synchronized. To some extent this is however already a necessity, as the system
employs preshared secrets with limited validity periods, though timestamps may
require a higher degree of accuracy.
5.6 Extending EMSG
5.6.2
81
Peer-to-Peer (P2P) Session Establishment
To explore a peer-to-peer session establishment protocol four different protocol
prototypes are presented. These four protocols do not differ in principle but rather
in algorithms and key sizes. All four protocols assume the existence of one or more
shared secret keys as well as one or more set of Diffie-Hellman parameters, both
addressed by eight byte identities. The objectives of this protocol are the same as
in the Kerberos-based expanded EMSG protocol and are described in section 5.6.
The following are the four protocols:
DH-1 Utilizing a Diffie-Hellman based session key establishment with a keysize
1536 bits.
DH-2 A slimmer version of DH-1 also utilizing a Diffie-Hellman based session key
establishment, but with a keysize 736 bits. This protocol is adapted to be
able to fit in single SMS messages.
ECDH-1 Utilizing a Elliptic Curve Diffie-Hellman (ECDH) based session key
establishment with a keysize 521 bits.
ECDH-2 A slimmer version of ECDH-1 also utilizing ECDH based session key
establishment, but with a keysize 384 bits.
The P2P Extended EMSG protocol transmits messages via a trifold message
exchange referred to as a session. The process consists of a transmission of an
EMSG-SessionRequest (EMSG-SREQ) message sent by the initiating user (denoted user A), followed by an EMSG-SessionResponse (EMSG-SRES) message
sent by the receiving user (denoted user B) finally followed by the message containing the actual data, the EMSG-SessionData (EMSG-SDAT) message sent by
user A. This process is depicted in figure 5.5.
In addition to the messages specified below a EMSG-FAIL as specified in section
5.6.1.4 could sent in case of a failure.
Figure 5.5. P2P Extended EMSG transmission
82
Security
5.6.2.1
EMSG-SREQ
The EMSG-SREQ message serves as a session initiation request and consists of
a challenge, a public DH-part and key identity identifying the specific preshared
key and Diffie-Hellman parameters to be used in this session. The contents of the
message is displayed in table 5.5.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x01 for EMSG-SREQ.
Session ID A identity assigned to this session by the sending user (user A).
Symmetric key ID Identifying the symmetric key or shared secret to be used
in this session.
DH key ID Identifying the Diffie-Hellman parameters to be used with the public
keys in this session.
Public DH part Providing the senders public Diffie-Hellman parameters.
Field
EMSG-Type
Session ID
Symmetric key ID
DH key ID
Public DH part
Total Length
DH-1 len
1
4
8
8
192
212
DH-2 len
1
4
8
8
96
116
ECDH-1 len
1
4
8
8
66
86
ECDH-2 len
1
4
8
8
49
69
Table 5.5. EMSG-SREQ message format (field length in bytes)
5.6.2.2
EMSG-SRES
The EMSG-SRES message constitutes the response to the session initiation request and serves to submit the recipients session parameters. The contents of the
message are displayed in table 5.6.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x02 for EMSG-SRES.
Session ID A identity assigned to this session by the sending user (user A).
Challenge Providing a cryptographic challenge to the sender (see section 5.3.7).
Public DH part Providing the senders public Diffie-Hellman parameters.
MAC A Message Authentication Code used to verify the integrity of the session.
The MAC is computed on all data sent by user B up to this point. As the
session key is not yet established a preshared key would have to be used.
This would preferably be a separate key only used for signing in case the
MAC would leak key information.
5.6 Extending EMSG
Field
EMSG-Type
Session ID
Public DH part
Challenge
MAC
Total Length
83
DH-1 len
1
4
192
4
32
232
DH-2 len
1
4
96
4
20
124
ECDH-1 len
1
4
66
4
32
106
ECDH-2 len
1
4
49
4
20
77
Table 5.6. EMSG-SRES message format (field length in bytes)
5.6.2.3
EMSG-SDAT
The EMSG-SDAT message serves as the final stage of the session establishment as
well as the payload transport message. The contents of the message is displayed
in table 5.7.
Protocol fields present in this message are:
EMSG-Type The type of EMSG, shall be set to 0x03 for EMSG-SDAT.
Session ID A identity assigned to this session by the sending user (user A).
Challenge Response Providing the response to the challenge.
Payload Length The length of the payload.
Payload The actual message.
MAC A Message Authentication Code used to verify the integrity of the session.
The MAC is computed on all data sent by user A up to this point.
Field
EMSG-Type
Session ID
Challenge Response
Payload Length
Payload
MAC
Total Length
DH-1 len
1
4
4
2
LP
32
42 + LP
DH-2 len
1
4
4
2
LP
20
30 + LP
ECDH-1 len
1
4
4
2
LP
32
42 + LP
ECDH-2 len
1
4
4
2
LP
20
30 + LP
Table 5.7. EMSG-SDAT message format (field length in bytes)
5.6.2.4
Channel Issues
In this section it is assumed that the messages are carried using SMS messages,
this channel is described in more detail in section 3.1.2.1. The amount of SMS
84
Security
messages required, N , to carry a data message of length L bytes is
1
L ≤ 138
N=
e
138 < L < 34168
d L+2
134
This assumes UDH is used for concatenating messages and that two bytes is used
to identify the message as being an EMSG of extended P2P type.
The total SMS messages needed to transmit one extended EMSG are displayed
in table 5.8, P in this table depends on the payload and may be zero. All protocols
except DH-1 fits into single messages, while DH-1 requires two messages to transfer
each public key.
Message
EMSG-SREQ
EMSG-SRES
EMSG-SDAT
Total Messages
DH-1
2
2
1+P
5+P
DH-2
1
1
1+P
3+P
ECDH-1
1
1
1+P
3+P
ECDH-2
1
1
1+P
3+P
Table 5.8. SMS messages required for transmitting an extended EMSG in P2P-mode.
The value of P is dependent on the payload.
Most organizations using the Tiger XS do not allow for the unit to be left
in active mode unsupervised. To function efficiently this protocol requires the
receiving Tiger XS units to be turned on during the message exchange, as this is
not how most Tiger XS units are used this is a major drawback. It also requires the
Tiger XS to pull messages from the phone without user interaction if the system
is not be experienced as cumbersome, a capability not currently implemented. In
effect, a P2P-oriented approach demands that the user accepts a process that is less
simple than previously existing messaging. For these reasons it is recommended
that this protocol be used as a fallback system and that the primary mean for
communicating be the Kerberos-based approach described in section 5.6.1.
5.6.2.5
Security
The protocols depicted above provide confidentiality using a customer-specific
symmetric cipher with a ephemeral key derived from a preshared secret key and a
Diffie-Hellman key agreement negotiation. The IV used with the cipher is derived
from either the public DH parts or a combination of the the public DH parts and
the session ID.
The MAC provided with the message assures that no part of the message has
been tampered with. This mitigates problems with so-called man-in-the-middle
attacks as all communications are signed by the parties involved.
The challenges, as well as to some extent the MACs, provide assurance that
the sender is in possession of the preshared symmetric key, which is equated with
being authorized to communicate.
The protocols provide protection against replay attacks as different public DH
keys can be expected for every new session.
5.6 Extending EMSG
85
The possibility of mounting a Denial-of-Service (DoS) attack exists. The strategy having the most effect would be to send extensive amounts of EMSG-SREQ
messages forcing the receiver to issue EMSG-SRES messages in response. Messages
of type EMSG-SRES and EMSG-SDAT with unknown session id and messages of
type EMSG-SDAT with invalid MACs could be dropped to limit the effect of a
DoS attack.
As for key sizes the following can be said: a 1024-bit key to be used with
exponential key schemes offers about 80 bits of security according to National
Institute of Standards and Technology (NIST) [16], keys of smaller size than 1024
bits are not addressed by NIST but an extrapolation indicates that a 768-bit key
would equate about 70 bits of security. Interpolation indicates that a 1536-bit
would provide about 96 bits of security. 96 and 70 bits of security is in most
modern applications considered insufficient. In this application, however, the DH
derived key is used in conjunction with a significantly stronger preshared key and
thus might offer sufficient protection. The choice of keysize in the protocols above
is, as apparent, related to how many SMS messages would be needed to carry them,
keysizes beyond 1536 bits would need at least three concatenated SMS messages to
carry the EMSG-SRES message. It should be stressed that a 768-bit exponential
DH-key is to be considered weak and should only be used if the ramifications of
this are found to be tolerable.
The ECC-based protocols mitigate problems with sending
√ large public keys
as ECC-keys are generally considered to have strength of O( n). The two keylengths used in the protocols above, 521-bit and 384-bit, equates 256 and 192 bits
of security respectively according to NIST [16], which should be enough for any
modern application. It should be noted that significant advances in the solving of
elliptic curve discrete logarithm problem, the problem from which ECC draws it
strength, could drastically reduce the strength of the system.
5.6.3
Pre-setup Session Keys
The following section does not describe a protocol in detail but rather outlines
ideas from which an exact protocol could be derived.
As the Tiger XS is capable of establishing a data link for encrypted voice data
and as this channel is deemed to be secure there exists a possibility of piggybacking on this channel to transmit and/or derive keys to be used for EMSG
communication.
In addition to an actual key, key validity parameters and a key id has to be
established. Key validity periods could be agreed upon implicitly given that the
two units have synchronized time references, for example setting the validity period
for the keys to the coming two weeks.
As keys would have to be stored for later use a key identity must be established
in order for both parties to be able to refer to the same key unambiguously. This
could be done in one of two ways: either an id could be negotiated whereby
each unit could assert that the id selected is not occupied by another key or one
could derive an identity via pseudo-randomization and assume that the number of
possible identities is sufficiently large for collisions to be overwhelmingly unlikely.
86
Security
A dedicated key exchange voice connection could be established for the sole
purpose of exchanging sets of session keys with the voice communication being
used only to verify the link fingerprints (as an added man-in-the-middle countermeasure).
As an alternative to dedicated key exchange voice connections a normal encrypted voice communication channel set up to enable telephoning could be used
to establish keys. Keys could be transmitted explicitly, with the downside of
having to reserve space for a side-channel to the normal voice channel. Keys
transmitted explicitly could be transferred at the call setup to not interfere with
voice transmission at the cost of a slightly longer call setup time. Alternatively
keys could be transferred during the actual conversation, preferably during silence
when less data is being transmitted or keys could be transmitted after the call has
been hung up with the downside of possibly confusing users as the line needs to
stay connected a few more seconds after the call has ended. A fourth option is to
let the key transmission be user triggered, offering the option of pressing a button
during a call to exchange keys.
Another approach is to derive keys implicitly, perhaps by using a cryptographic
hash function to hash long sections of voice communication data or protocol data
such as public keys. Similarly a pseudo-random key identity could be derived by
hashing voice or other protocol data. The advantage of implicit keys is of course
the lack of need for a side-channel transmitting keys, one of the main disadvantages
would be the increased risk of synchronization issues as a single bit-error in the
decrypted plaintext would cause users to derive different keys.
Constructing keys requires access to a source which is deemed to be sufficiently
random, for some applications, particularly military applications, finding a sufficiently random source might pose a problem. Only a keyspace of n bits where
all 2n different keys are equiprobable will have an entropy of n bits, the entropy
of the bit being crucial for the security of the system. Highly compressed voice
data constitutes a high entropy pseudo-random source and could perhaps, as noted
above, be used when constructing keys.
Keys and their properties would have to be stored on the devices until used,
they would preferably be stored encrypted.
5.6.3.1
Security
Key establishment, derived or transmitted, relies on the security of the voice data
channel. A voice data connection could be made over a more trusted channel prior
to a stay in a region where channels are known to be monitored. An example of
this would be a diplomat making voice data connections over a domestic telephone
network or perhaps even a null-modem, and thereby establishing keys, before
traveling to foreign country where systematic wiretapping is known to occur.
Provided that session keys and their properties are stored encrypted on the
devices, their secrecy is dependent on the key used for encrypting them. Given
that the device provides means for keeping such keys, perhaps in the form of
the preshared secret key used for other communication, storage could be made
reasonable secure.
5.7 Source Coding and Cryptography
87
The security of the message transmission would be enhanced by using preshared
secret session keys, the level of security enhancement is highly dependent on the
exact implementation.
5.7
Source Coding and Cryptography
In communication systems source coding and cryptography often coexist. There
are two aspects of the coexistence of source coding and cryptography that is important to note.
Crypto analysis of a cipher is made more difficult if the plaintext is effectively
source coded. Any knowledge of the structure of the plaintext language can provide
assistance when analyzing the ciphered text. The extent of the aid which this can
provide in a crypto analysis can be exemplified by considering the ciphering of an
ASCII-text in which every eight bit is always set to zero. Knowledge of that the
plaintext is composed of ASCII-text effectively enables a partially known plaintext
attack, a problem which could be mitigated by source coding the text.
Source coding must be carried out prior to the encryption. The output of
a cipher is composed of a sequence of uncorrelated bits, the lack of correlation
makes source coding ineffective. Ciphertext that can be effectively compressed
using source coding is a clear sign of a flawed cipher design and cause for serious
concern.
5.8
Steganography Using PPM
During the work on the source coding methods presented in this thesis it was noted
that the PPM source coding algorithm potentially could be used as a method to
enhance security by concealing the fact that the messages are carrying encrypted
data.
Common encryption systems (including the encrypted short message system
present in the Tiger XS) use a cryptographic function to transform text to what
appears as a random stream of binary digits.
As the encrypted data is sent using SMS messages, that is, over a text channel
the presence of non-textual data is certain to attract attention. As attention is
rarely wanted, indeed often unwanted, it might be desirable to form the SMS
messages so that they appear to be regular text messages. This art of hiding data
in other data is generally referred to as steganography. The process is illustrated
in figure 5.6.
PPM (see section 4.6.5.6) constitutes a function mapping every possible combination of characters onto a sequence of bits whose length is dependent on the
probability of the specific combination of characters. The strength of PPM, when,
as normally, used for source coding, lies in the fact that probable combinations
of characters map onto shorter sequences of bits, whereby one could achieve compression of data.
When using PPM for source coding a sequence of characters is compressed into
a sequence of bits which is sent over a channel. When using PPM for steganography
88
Security
a sequence of bits is expanded into a sequence of characters which is then sent
over a channel. Therefore PPM steganographic encoding is performed like PPM
source decoding and PPM steganographic decoding is performed like PPM source
encoding.
When feeding a PPM decoder with random bits it is likely that the combinations of characters which the bits are decoded into are combinations of characters
that are occurring frequently in the language reference file used to set up the PPM
statistics. It is here postulated that for sufficiently long sequences of bits a large
proportion of the character combinations will appear as in a regular text similar
to the language reference file.
The practicality of a PPM-based steganography was evaluated using tools built
into the source coding evaluation environment (see section 4.3). Payload data was
simulated using the output of a Linear Feedback Shift Register (LFSR).
In addition to the text presented in the sections below, four slightly longer
output texts are presented in section D of the appendix.
5.8.1
The Protocol
When using PPM for source coding the size of the data needed to be transferred
would be expected to drop, perhaps significantly. This is due to the simple fact
that the number of “likely” sequences of characters in a text of length n is far
fewer than the number of possible combinations of n bytes.
This principle holds true when PPM is used for steganography as well and
results in an invertible expansion of the data needed to be transmitted. This is
the price of steganography. A trade-off exists here: the more certain one likes to
be that the stegotext resembles the reference text (and therefore also resembles
normal text), the more expanded the data is going to be.
The use of arithmetic coding makes the situation a little bit more complicated
as the steganographic decoding may generate a few more bits than originally intended. To mitigate this a protocol field indicating the payload size would have
to be prepended to the payload. This field could very well be coded by a variable
size code as the Golomb code described in section 4.7.1.1.
An inherit problem of using this model of steganography as a generic protocol
is the risk of the payload data containing long runs of zeros or ones which will
typically create messages with initial combinations of letters unlikely to occur in
normal text. In the implementation used for this thesis such a message will start
with “AAA. . . ”. This problem could be mitigated using scrambling employing a
LFSR before encoding the data. The effect of the scrambler on the data is reversed
on the receiving side by once more letting the scrambler process the data. The
use of a scrambler is illustrated in figure 5.7.
The function mapping a sequence of bits into a stegotext constitutes a bijective
function. Since this will map a sequence of bits onto one specific stegotext only
there exists a risk that this stegotext might have a large proportion of unlikely
character combinations and therefore not offering a particularly good cover for
the data. It might therefore be desirable to instead use a surjective version of
the function. This could be achieved by prepending a small sequence of bits to
5.8 Steganography Using PPM
(a) Standard Transmission
(b) Steganographic Transmission
Figure 5.6. Two ways of transferring the message
89
90
Security
Figure 5.7. Scrambler function
Figure 5.8. Steganographic packet
the payload sequence. When performing the steganographic decoding these bits
should be removed.
The existence of a small set of bits that can be freely varied by the steganographic encoder assures that several sequence of characters can be regarded and
that the one assumed to be the best, by some criteria, can be used.
Such a small set of bits, hereafter referred to as a seed, could be constructed
using a fixed, predetermined set of bits generating a fixed, finite number of possible
seeds. It could also be constructed using a variable length code such as the Golomb
code (see section 4.7.1.1). In particular modifying a standard Golomb code to
encode the r value first followed by the q value would provide a big variance in
the initial bits while also providing an infinite amount of possible seed values.
Combining the methods mentioned above results in a packet structure as depicted in figure 5.8.
5.8.2
Adapting PPM for Steganography
PPM is an algorithm adapted for source coding. Before using it for steganography
a small number changes would preferably be made in order to achieve better
results.
An important principle in PPM-coding is that no matter what characters appear in a stream the encoder must be able to encode it. This is achieved by
asserting that all characters have a estimated probability greater than zero and
they can therefore be encoded using a finite set of bits. In practice this is asserted
by allowing predictions from the context order referred to as order -1, to which
one can always escape and where all characters are predicted and predicted with
equal probability.
5.8 Steganography Using PPM
91
There is no need to be able to code all characters when using PPM for steganography, in fact it is desirable if some characters do not appear in the stegotext, in
particular all non-printable characters are unwanted. The use of characters which
are not expected to appear in the text at all can be avoided by not making any
predictions from order -1 (in practice by setting the escape symbol probability to
zero in order 0).
To avoid patterns that are unlikely to appear in text one could further reduce
the scope of the model by not predicting from order 0, making all predictions
context-dependent, this could be done by setting the escape symbol probability to
zero in order 1.
To further make the output appear closer to the source the escape symbol
probabilities could be scaled in order to make an escape to a lower order context
less likely.
All methods described above do make the stegotext look more inconspicuous,
but they come at a price: they are likely to result in a bigger expansion of the
text. To illustrate this, four texts where generated using the same 200 bytes of
payload (and a 2 bit seed, seed=11 was used for all cases). The PPM-settings as
well as their respective expansion factor, ef , is presented in the list below. The
first 300 characters of the texts them selves are also included in figure 5.10 through
figure 5.13. The language reference file CaP (see section A) was used.
StandardPPM (ef = 3.71) Predicting from all lower contexts. No scaling, no
capital conversion, maximum context is set to 12.
NoOrder(-1) (ef = 3.83) Predicting from context order 0 and up. No scaling,
no capital conversion, maximum context is set to 12.
NoOrder(0) (ef = 3.80) Predicting from context order 1 and up. No scaling, no
capital conversion, maximum context is set to 12.
EscapeScaling (ef = 5.72) Predicting from context order 1 and up. Escape
symbol probabilities is scaled by 51 , no capital conversion, maximum context
is set to 12.
To further investigate the effect of the settings varied above, 256 different
payloads where tested with each one of the profiles above. The average, minimum
and maximum expansion factor for each profile is presented in table 5.9. The
histogram of the EscapeScaling profile is illustrated in figure 5.9.
As evident by the empirical results provided here, not predicting from order 0
and below costs very little in terms of expansion, whilst providing a result that
appears more like the source text.
5.8.3
Detecting PPM Steganography
As apparent by reading the stegotext provided in this thesis any person would
easily see that the message is not a normal, readable text. The challenge is in
this thesis thus focused on providing a stegotext sufficiently close to a normal
text to avoid detection by automated systems. As automated systems could be
92
Security
Figure 5.9. Histogram over the expansion factor, ef , in the EscapeScaling message set
Profile
StandardPPM
NoOrdo(-1)
NoOrdo(0)
EscapeScaling
Minimum
3.43
3.48
3.51
5.30
Average
3.88
3.89
3.93
5.77
Maximum
4.36
4.30
4.35
6.30
Table 5.9. Expansion factors for steganographic PPM
5.8 Steganography Using PPM
our longing!
ExtedRaskolnike a long Vated immovableciallycigarething, too aback,[*]ence.
It’s special cing?’ and trial. Now forsomeone mutteredge speakers.
I will wait a level never geoff, how load! Now in no atorst! Or
thisbrutat food,w shops, iciout I’very Petrovitch’s thungeroom,
whethertimov isant went off into have gospel,but thuld depends
onthe police office.’ Tchebath. What asombre and waspedhim with
Sonia knew thick feelings everal of phole morning. These wordsgivingsking fourteen mean: ’Anwhim; on that yesterder on Sonia’t
makes,ppearanciedly, am I am? That’s soakck forgivenessives eve I
to gain! But he... mouthe first time wer. But I havesualready her
face (Ivand he accused. And; sock-sparriage, who widently on purpose!
Figure 5.10. StandardPPM, ef = 3.71
our me tellyou? Only isked irrid ofLeave off, todo go out one fools.
Youbles own pillow.
Whave forgotten?Bu asides in thelaw, apawn though it regedto say
something else or whiskers... I’ve nappy!Pyotr Petrovitch and onmy
sisterary men, normality....
The crowd of the very apokely, always two literary out? cried in dismailov? You try tugglad)Ik you, squeezy.’ Dounia crowdeduct But at
last break, suppliar yourfiry crewed.’
I were lf.
Nastasya soft side oflov.
Leave roublesame it? Luzhing the fourteouslighteems your honourable
silvaniseyes for you in Andreys afraid yoursus said Razumihin oners
weretheir of right orphasister–she revolver.I drag a convulsively aft,
Ing a tearstanderstandishest usgrace, he hadneveness. You know Missy,
Raskolnia.
Razut;ited
Figure 5.11. StandardPPM, ef = 3.83
93
94
Security
our moreover, quietly, it’ I did! He One for us trace another.Thirty
copecks stayed andeve it again:t even able things. We willgarity idea,
but coulds. Ond alittle busing again.
Veant to water; he got ill, strov, thoughtRaskush! Therehad tural
haughtime.
Ovitch, threst.
Pyotr Petrovitch.He up unto Histakestroyer,swhathe was talkingatm
the chemissy, as they put on thinking,but, for I jewel-carcelya Petrovitch.He passage, noise,and looking night, breathing coulevards.. Wements; no, though he fleepossessince of Reasy.Tell us to kered in hurried, angry asdangerous and the murder of joy; later Katerina Ivar wipe
first thing t of by nine hungas live of mustn in hisopitall speak paintly
quitement. Wried lying on Por drink! Tarisedly te she lived works.
Figure 5.12. StandardPPM, ef = 3.80
our inted, in earnest. You knowing when hehundred paces, stairs....
Oh, no, near ago he axe in the noose. They are children, or himself,
and induced shore him. What do you means for yourled faintly. He was
in a state of fighting, quite unfortunate! But I’ll condit a discoldered,
sat up in fun; course?
Ah a r skinny brother, I don’t know what I am so sight of hurts me,
genius. And if I explained on thefloor is over! Finn, leave me, Crime,
for you know allmy bearing thecond passage to loose findsince Ihaven’t
been taken intently.... Never mind as our by it all, mercy,’ she said,
good-bye for the present my engaged in a rapture, to-mind. Upon me!
Don’t do you want with me!
I she had been his oblements, thetwo arm glad, he let out in the first
place side of the wall about it’s our AlyonaIvanovna, he said softly,
wept and they could, as itwere, youwill provided, withyou.
That’s as totake left, he shouted to him, but stooping and spluttering with fury another myself....Youare to accept duty of effort, Pole,
answered Svidropposite side ofthis to yearsI am not justified hisowing
this afaintance of late a shine and furiously at the
Figure 5.13. EscapeScaling, ef = 5.72
5.8 Steganography Using PPM
95
constructed in an infinite amount of ways the steganographic methods ability to
construct messages that avoids detection will not be measured, rather a short
discourse on the subject will be provided.
The most intuitive and perhaps the most practical method of scanning messages
to spot non-textual messages is of course to use a dictionary comparing the words
in the message to words in the dictionary. As this approach is likely to feature
a threshold filtering out messages where less than a certain percentage of the
words in the message is identified as proper words this scheme is easily defeated
by asserting that the stegotext contains enough proper words, which judging from
stegotexts supplied in appendix D is possible.
A more advanced approach may feature elements assessing the grammar. But
as grammar in text messages typically varies in quality a threshold would have
to be set generously. Stegotexts generated from PPM models with higher context
orders will have grammatical features of some extent as a context often will consist
of several words.
It is possible that the texts could fool human review provided that the reviewing
person has no knowledge or very poor knowledge of the language in question. No
experimental data on this is made available here, though.
The ability to determine what is weird is essential if seeding is to be used.
Seeding requires some method to determine the relative perceived weirdness of the
text and select a text for transmission, preferably the least weird and/or shortest
text. To determine the nature of perceived weirdness in texts is beyond the scope
of this thesis and no theories on this are provided here.
5.8.4
Cryptographic Aspects
It is important to note that the cryptographic security of a message sent using the
method of steganography presented here rests on the encryption system itself, as
the message is encrypted prior to applying steganographic methods.
Never the less, it is tenable to assume that the use of PPM-steganography as
presented here would significantly complicate crypto analysis of the message. It
is reasonable to assume that the payload can not be recovered from the stegotext
without knowledge of the statistical model used by the PPM-predictor. In this
sense the statistics employed by the PPM-predictor serve as a cryptographic key.
5.8.5
Practical Test
A practical test as to the function of a prototype system was carried out in the
following manner. Texts were source coded and encrypted using 128-bit AES
encryption. The resulting data was then fed through a steganographic encoder.
The process is illustrated in figure 5.14. Two sets of plaintext and stegotexts
resulting from this operation can be found in appendix D.
96
Security
Figure 5.14. Encoding process
Chapter 6
Conclusions,
Recommendations and
Further Studies
This chapter discusses conclusions, presents recommendations and discourses further studies. It will do so by
- Summarize the findings of the previous chapters and draw conclusions from
those. (section 6.1)
- Recommend the best methods and algorithms to implement. (section 6.2)
- Indicate areas addressed in this thesis that would benefit from further studies. (section 6.3)
6.1
Conclusions
This section will discuss the findings of the previous chapters.
6.1.1
Communication Protocols
We conclude that the message exchange would benefit substantially from the use
of UDH-headers. The protocol described in the section 3.2 offers a solution as how
to code mixed-content messages.
6.1.2
Source Coding
Several fairly effective source coding methods were derived. It was proved that
even though the source message was too small to offer reliable statistics for coding
it a source deemed to be “similar” could be used instead, with performance still
being acceptable.
97
98
6.1.2.1
Conclusions, Recommendations and Further Studies
Dictionary Coding
Dictionary coding is an intuitive method and it is relatively easy to understand
the consequences of the design choices such as dictionary size and the like.
To see which method may be the most beneficial, comparative graphs presenting the performance of the various methods are included in appendix C.
For fixed length codes the Q-gram method is the obvious choice. It outperforms
all other methods regardless of dictionary size and regardless if the texts deviate
slightly from the model (i.e. when reference messages in English are included as
well).
For variable length codes the performance is a bit more even. We start by
noting that the performance of the Unigram method is fairly good, with expected
coding rates of 4.5 – 5.0 bits/character. The Wordbook method offers the best
performance at low and medium dictionary sizes while, quite surprisingly, retaining
the best performance even if English messages are included. For large dictionary
sizes the Q-gram method proves the virtues of including character combinations
other than words in the dictionary, perhaps there simply aren’t enough commonly
used words in the Swedish language text to fill a dictionary with over 4000 entries.
Implementing a larger fixed length dictionary coder perhaps offers the best
complexity/performance trade-off, only the actual q-grams have to be stored, while
their respective bit pattern can be derived from their index. Encoding is basically
only a simple parsing problem and decoding is simply implemented as a lookuptable. This approach will give an expected coding rate of around 4.5 bits/character.
If instead a memory/performance trade-off is desired, a smaller variable length
Wordbook-dictionary offers performance of about 4.0 bits/character. A simple
Unigram dictionary would offer performance at about 4.5 – 5.0 bits/char at a
completely negligible memory cost and fast throughput. A Unigram coder also
has other benefits, such as not being affected by spelling errors since a characters
coding is not context-dependent.
6.1.2.2
PPM
PPM offers performance unrivaled by dictionary coding. Despite its low coding
rate on data similar to the reference model it retains a relatively low coding rate
on data in other languages.
There are a large number of parameters in PPM-compression that can be
varied, and many of them have been tested. The following can be said about
the choice of parameters:
Exclusion Exclusion should be used. The benefits of using lazy exclusion are
marginal and they degrade the compression significantly. Experimenting
with update exclusion would be advised as the actual update is only done
once when building the tree.
Coding Arithmetic coding is superior given that the statistics it uses are accurate.
Huffman codes must be rebuilt for every symbol (including escape symbols!)
when used by PPM, this poses a hefty computational cost aside from the
loss of compression performance.
6.1 Conclusions
99
Maximum Context Order The tests carried out indicate that using context
with lengths over eight characters give no substantial gain in compression
performance. Memory requirements are lowered significantly when shorter
contexts are used.
Deterministic Contexts Scaling predictions of deterministic contexts, quite surprisingly, turned out to degrade performance. Perhaps because of that the
data is not as homogeneous as when using adaptive compression on larger
datasets.
Escape Methods Method C turned out to be the best method, not too unexpectedly. Method C is also a simple method to implement, making this an
obvious choice.
Occurrence Thresholds A high occurrence threshold is a very effective method
to reduce memory requirements significantly at a fairly low price in performance. Which threshold is “best” depends on the size of the language
reference file, as well as the amount of available memory.
Probability Scaling Scaling the escape probability turned out to be ineffective,
regardless of whether one scaled up or down - a testament to the efficiency
of the predictor.
Capital Conversion Capital conversion is a relatively easy feature to implement
compared to PPM as a whole. It offers a noticeable performance boost and
should be implemented if PPM is used.
Implementing PPM-coding is not as difficult as it seems. Implementing a predictor is relatively simple given that the data structures in terms of, for example, a
forward tree is implemented. Prediction when decoding and encoding is identical,
simplifying implementation. Implementing a forward tree structure and the means
by which to populate it with data is not a particularly complex accomplishment.
Implementing an integer-based arithmetic coder is a fairly challenging task, but
ready-made versions may very well be available.
Memory requirements vary, but the tests carried out indicate that by reducing
the maximum context or by raising the occurrence threshold one could reduce the
number of nodes to about 20 000´, while still maintaining fairly good compression
ratios. This is approximately equivalent to a need to store 128 kilobytes of nonwritable data. The number is higher than the 64 kilobytes specified as the highest
reasonable memory requirement, but it will still fit within the hardware limits.
Implementing PPM enables state-of-the-art compression at a reasonably low cost.
6.1.3
Security
We note that the security of the message transmission system could benefit from
the use of session keys.
The centrally assigned keys offer more rapid communication as a key could be
issued by a KDC instantly.
100
Conclusions, Recommendations and Further Studies
The P2P-approach does not rely on a third party and is therefore less sensitive
to service outages. It does however require the recipient to have a connected
Tiger XS. In order for the process not to be cumbersome it also requires being
able to pull messages from the GSM phone automatically.
Pre-setup session keys offer the ability to avoid extra message exchange and
may be the simplest method to achieve the purpose.
The PPM-based technique for steganography offers an efficient way of masking
encrypted messages. By allowing the messages to expand, a result that resembles
text could be achieved, at an increase of the data rate of a factor 6.1 The factor
could be varied depending on what is considered necessary. Implementing this
type of steganography is a very small step if PPM-coding is already implemented.
6.2
Recommendations
This section will present the recommendations as to what would be beneficial to
implement.
6.2.1
Communication Protocols
It is strongly recommended that UDH-headers would be used to span message
content over several SMS-messages if needed. It is recommended that elements in
the protocol described in section 3.2 be considered if the coding of mixed-content
messages is desired.
6.2.2
Source Coding
Recommendations as how to implement a source coding system in the Tiger XS
will be made in the form of three levels.
The first level will offer a decent source coder at a minimal price in complexity
and resource requirements. The second one will offer a better coder at the price
of higher resource requirements. The third one will recommend the best source
coding method evaluated in this thesis with less regard to complexity and resource
requirements.
6.2.2.1
Level 1: Variable Length Unigram Coder
As the simple implementation, a variable-length code for single characters is recommended.
It requires implementing a Huffman-encoder/decoder and a simple codewordtable for the language to be used. As the tables will be limited to 256 entries,
tables for multiple language may be included.
A data rate of between 4.5 – 5.0 bits/character is expected and the coding
scheme would be quite resilient to changes in the language as long as the two
languages use the same alphabet.
1 This
factor gets significantly smaller if initial text source coding is employed
6.3 Further Studies
6.2.2.2
101
Level 2: Large, Fixed Length Q-gram Dictionary Coder
A fixed length code is among the simplest to implement, it does not require the
implementation of a variable length encoder/decoder, and is perhaps simpler to
implement than the Level 1 recommendation.
To be precise a “Large dictionary” refers to a dictionary with 4 096 different
q-grams and corresponding 12-bit codes. Fixing the maximum length of the Qgram to eight characters would enable storing the dictionary in a simple array of
8-character sequences fitting within 32 kilobytes (8 ∗ 4096 = 32768) of memory
(non-writable will do).
From this one could expect compression performance of about 3.9 bits/character.
6.2.2.3
Level 3: PPM-Coder
A PPM-coder is the recommendation in the high-performance category. It requires
implementation of a arithmetic coder, a PPM-predictor and the inclusion of a
searchable statistics structure.
It is recommended that a maximum context of 5–8 is used and that the occurrence threshold is set high enough for the statistics data to fit in memory.
At about 128 kilobytes of memory one could expect performance of about 3.1
bits/character and at the cost of 256 kliobytes one could expect a performance of
about 3.0 bits/character.
6.2.3
Security
It is recommended that the three different session key establishment methods
presented in this thesis are considered if such an extension is desired.
It is noted that though the steganography protocol may not be in demand
it would probably result in some interesting moments when demonstrating the
product for customers.
6.3
Further Studies
Improved parsing methods may be of interest. Particularly the effect of the use of
optimal parsing would be interesting to investigate.
Advanced PPM-enhanchments that significantly benefit the application may
be available. Particularly algorithms relating to storing statistics efficiently.
The steganography approach presented in this thesis may benefit from further
studies as it appears to be a completely new and slightly odd concept. A study of
the correlation of PPM-coding cost to perceived weirdness is of particular need.
An efficient analysis method for determining the relative weirdness of a generated
text would be useful.
A more thorough security analysis of the security methods presented in this
thesis is of course needed if they were to be realized.
102
Conclusions, Recommendations and Further Studies
Bibliography
[1] Jurgen Abel and William Teahan. Universal text preprocessing for data compression. IEEE Trans. Comput., 54(5):497–507, 2005.
[2] T. Bell and A. Moffat. A note on the dmc data compression scheme. Comput.
J., 32(1):16–20, 1989.
[3] Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text compression.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1990.
[4] Alex Biryukov, Adi Shamir, and David Wagner. Real time cryptanalysis of
A5/1 on a PC. Lecture Notes in Computer Science, 1978:1+, 2001.
[5] Charles Bloom. Solving the problem of context modelling, 1998.
[6] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression
algorithm. 1994.
[7] J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The
Computer Journal, 40(2/3), 1997.
[8] J. G. Cleary and I. H. Witten. Data compression using adaptive coding
and partial string matching. IEEE Transactions on Communications, COM32(4):396–402, April 1984.
[9] T.M. Cover and R. King. A convergent gambling estimate of the entropy of
English. ieeeit, IT-24:413–421, 1978.
[10] F. Dawson and T. Howes. vCard MIME Directory Profile. RFC 2426 (Proposed Standard), September 1998.
[11] N. Keller E. Barkan, E. Bihan. Instant ciphertext-only cryptanalysis of gsm
encrypted communication. CRYPTO 2003, pages 1–18, 2003.
[12] ETSI. Digital cellular telecommunications system (Phase 2+) Compression
algorithm for text messaging services, v. 7.1.1, July 1999. ETSI TS 101 032.
[13] ETSI. Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommunications System (UMTS); Compression algorithm for SMS;
v. 6.0.0, December 2004. 3GPP TS 23.042.
103
104
Bibliography
[14] T. Howes, M. Smith, and F. Dawson. A MIME Content-Type for Directory
Information. RFC 2425 (Proposed Standard), September 1998.
[15] David A. Huffman. A method for the construction of minimum redundancy
codes. Proceedings of the Institute of Radio Engineers, 40(9):1098–1101, Sep
1952.
[16] NIST. Recommendation for Key Management, May 2006.
[17] Jeremy Quirke. Security in the gsm system. May 2004.
[18] Khalid Sayood. Introduction to Data Compression. Academice Press, 2000.
[19] C.E. Shannon. Prediction and entropy of printed English. Bell Sys. Tech.
Journal, 30:50–64, January 1951.
[20] Przemyslaw Skibinski, Szymon Grabowski, and Sebastian Deorowicz. Revisiting dictionary-based compression. Softw. Pract. Exper., 35(15):1455–1476,
2005.
[21] W. J. Teahan. Modelling english text, 1997.
[22] P. A. J. Volf. Switching between two universal source coding algorithms. In
DCC ’98: Proceedings of the Conference on Data Compression, page 491,
Washington, DC, USA, 1998. IEEE Computer Society.
[23] Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating
the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085–1094, 1991.
[24] Ian H. Witten, Timothy C. Bell, Alistair Moffat, Craig G. Nevill-Manning,
Tony C. Smith, and Harold Thimbleby. Semantic and generative models for
lossy text compression. The Computer Journal, 37(2):83–87, 1994.
[25] J. Ziv and A. Lempel. Compression of individual sequences by variable rate
coding. ieeeit, IT-24:530–536, 1978.
Appendix A
Language Reference Files
The object of the language reference files is to train the source coder in order to
make it efficient at compressing data close in constitution to the language reference
file itself. All files are comprised of text, although that is by no means a necessity.
Six different language reference files were produced, they are all described
below. All files are plain text files encoded using ISO 8859-1 (windows-1252, Latin
alphabet, Central European).
A.1
DN-4
DN-4 is the fourth revision of a textfile comprised of articles from the Swedish
newspaper Dagens Nyheter. All articles have been processed so as to remove
characters, excessive line breaks and other structures that are unlikely to occur in
text messages.
The file is comprised of 102 721 characters. The text is in Swedish.
A.2
CaP
Crime and Punishment by Fyodor Dostoyevsky. The format has been altered,
removing formatting and excessive line breaks.
The file is comprised of 1 154 997 characters. The text is in English.
A.3
RodaRummet
Röda Rummet by August Strindberg. The format has been altered, removing
formatting and excessive line breaks.
The file is comprised of 529 333 characters.
The file is comprised of characters. The text is in Swedish.
105
106
A.4
Language Reference Files
Bibeln
The 1917 Swedish revision of the Bible. The format has been altered as to resemble
a compact text.
The file is comprised of 4 190 614 characters. The text is in Swedish.
A.5
Nils
Nils Holgerssons underbara resa genom Sverige by Selma Lagerlöf. The format
has been altered, removing formatting and excessive linebreaks.
The file is comprised of 1 095 314 characters. The text is in Swedish.
A.6
Macbeth
Macbeth by William Shakespear. The format have been altered, removing formatting and excessive line breaks.
The file is comprised of 103 304 characters. The text is in English.
Appendix B
Performance Reference Files
Performance reference files are the files by which the performance of the different
compression schemes are measured upon.
B.1
Jordbävning
This text is a short article taken from the Swedish newspaper Dagens Nyheter. It
is written in Swedish and describes an Earthquake in Iran.
The text is comprised of 480 characters.
The Text:
En jordbävning uppmätt till 5,9 på Richterskalan skakade i dag södra Iran, rapporterar den statliga radion. Skalvet var kraftigast kring
staden Arzuieh 75 mil sydost om Teheran. Det finns ännu inga uppgifter
om döda eller skadade. Däremot säger lokala myndigheter att omkring
1.400 bostadshus helt eller delvis har förstörts. Jordbävningen beskrivs
som måttlig, men skalv på denna nivå har tidigare dödat tusentals
människor på den iranska landsbygden, där många hus är instabila.
B.2
Nelson
This text is a short text in English.
The text is comprised of 41 characters.
The Text:
England expects every man to do his duty.
B.3
Diplomatic-1
This is a classified text by Swedish authorities, no details except its length and
that it is written in Swedish is given here.
The text is comprised of 1 315 characters.
107
108
B.4
Performance Reference Files
Jalalabad
This is a short section of an imititative text describing reported events. The text
is in Swedish.
The text is comprised of 521 characters.
The Text:
Två afghanska civila vittnen uppger sig ha sett vad som förmodas vara
en vapentransport utförd av en grupp motståndsmän.
Observationen ägde enligt uppgift rum klockan 22:30 onsdagen den 12
april.
Gruppen skall enligt uppgift ha siktats vid ett vägskäl i utkanten av
Kama Ado, 48 kilometer ifrån Jalalabad. Gruppen skall ha bestått
av sju stycken män av afghansk härkomst klädda i civila kläder. Fyra
av männen skall ha burit automatvapen synligt. Männen skall enligt
uppgift ha lastat på trälådor på två lastbilar.
B.5
Blair
The opening of a speech given by English Prime Minister Tony Blair after the
bombing of the London subway.
The text is comprised of 929 characters.
The Text:
The greatest danger is that we fail to face up to the nature of the threat
we are dealing with. What we witnessed in London last Thursday week
was not an aberrant act.
It was not random. It was not a product of particular local circumstances in West Yorkshire.
Senseless though any such horrible murder is, it was not without sense
for its organisers. It had a purpose. It was done according to a plan.
It was meant.
What we are confronting here is an evil ideology.
It is not a clash of civilisations - all civilised people, Muslim or other,
feel revulsion at it. But it is a global struggle and it is a battle of ideas,
hearts and minds, both within Islam and outside it.
This is the battle that must be won, a battle not just about the terrorist methods but their views. Not just their barbaric acts, but their
barbaric ideas. Not only what they do but what they think and the
thinking they would impose on others.
Appendix C
Source Coding Graphs
C.1
Dictionary Techniques
Four graphs are presented here comparing different dictionary techniques when
using fixed length as well as variable length on the messages in Swedish (closer
to the language model) as well as all messages (including those in other language
than the model).
Figure C.1. Comparison: Fixed length codes on Swedish messages
109
110
Source Coding Graphs
Figure C.2. Comparison: Variable length codes on Swedish messages
Figure C.3. Comparison: Fixed length codes on all messages
C.1 Dictionary Techniques
Figure C.4. Comparison: Variable length codes on all messages
111
Appendix D
Steganographic Texts
D.1
DN-1
Language reference file DN-4 has been used when encoding this. Contexts of maximum 16 characters and minimum 1 characters in length was used. No occurrence
threshold for context were used. Scaling of the escape probabilities by 15 is performed. The steganographic expansion factor, ef , is 7.87, but taking into account
initial source coding the actual expansion factor is 2.95.
Plaintext
Militärjuntan har utsett en premiärminister
Thailändska tidningar uppger att Supachai Panitchpakdi, före detta
chef för WTO och handelsminister, har accepterat ett erbjudande av
juntan att bli ny premiärminister för interimsregeringen, rapporterade
Reuters på tisdagen. Supachai, som nu arbetar för FN i Génève, lät sig
övertalas att acceptera posten av den kungliga chefsrådgivaren Prem
Tinsulanonda, skriver den engelskspråkiga tidningen Nation, som citerar "högt placerade källor". Thaispråkiga tidningar rapporterade liknande versioner, enligt Reuters.
Stegotext
Dessutom att polisen efter en normal bankomat. Meri valdeskriva
det, använder hemmavidaren steget mot etis beteendtals vål, menar
Vladislav Kazmierska TVärldshälften gemensade att säkerhetsstämpe
glömmit i riksdans medålig kontroll övensson möjliggöra hemligstämplad, säger Fridolyckohjälp av tre gånger omöjlighetet med rovdjurs
verk tog maktfullkomligt: Du för hur Irwin simmar bredvikt,
Melområn överlämnade ocksbussen i följöverket handlar inte bara områn politar dockan vidare. Håkan Waxegången. Post serverad militärbasen.
Wästberedda nämnar med, och ger styrelseordförande i fonden, att
Cutty Sarkotikabrott. Hit högrev från Bromma flygplats föras till
112
D.2 CaP-1
113
Egyptierna. Interna i Eurskifte och gamla damerika håltsfredligare
ett försök från Uneschenskapliga influters fråga. Englandetsammurdes av dykare tydde påtalar riktivt blank oron för Europaråklagare
Tornvisar många länder till den immuniversitet senaste,7 pension. VÄNERSBOReuters från djuptiets positionen hypel även hur dåligt
förhålla element i de ammunitet kan inte i det listerna, och uppmanadsbygd unde inte följande demonshytten.
Nu är domen motisk, säger en kvicksilvret (elserala idémagav upp är
ju uppenbarliga influensavirus. Därför finns det med mycker att en
mängd fondomar, säkransar på tråden. Maskning för NRF. Tvaret
om ettan operativ chefer planen inget nämna de första icke-kommer
att spela sinspektört polisens platset inför passade Peter Manmohan
Singhet prövad. Eppolito om Ivertsätter Thundre barn eller intrerat
på ön. Medantaga-virus från Skratta åt den i 78 döda marginal till
övrihet. Hit över. De som.
Beslutsfattarna somsprev egyptier ska utmå sinlan
D.2
CaP-1
Language reference file CaP has been used when encoding this. Contexts of maximum 12 characters and minimum 1 characters in length were used. Scaling of
the escape probabilities by 15 is performed. The steganographic expansion factor,
ef , is 5.57, but taking into account initial source coding this the total expansion
factor is 2.68.
Plaintext
Thailand appoints interim PM
Former World Trade Organisation head Supachai Panitchpakdi has
agreed to be Thailand’s new interim prime minister, newspapers said
on Tuesday, as the country’s military rulers unveiled a plan to return
gradually to the barracks.
However, promises from army chief Sonthi Boonyaratglin, who overthrew prime minister Thaksin Shinawatra in a bloodless coup a week
ago, to restore democracy within a year sounded like a re-run of a military putsch in 1991, analysts said.
Stegotext
Your brother and sister were makes and gone? You have are first.
Raskolnikov in here was the best. Vahrushin, oh ye come toyour else.
Is it precial again.... The second thouse, it’s true; he flung him comrades goabsurd, and subscript had believe in theirs I notcome forth,
you’d better readyto bring thread her tall write angry and bowed, he
added answered sharplyevskyI amp. Freedom for he is afool, haven’ !
cried the streamily–it was evident that therefore so. She hased doubtle
mother? Our Lorder,lost my lessons an indescrutinised him. Somefift
114
Steganographic Texts
her in the rain! I mere capable, than onclusion.Yusupovg it in the circumstance abounds sum invite position, you wild haughtiness. That’s
right, is the firstment of instant (thoughshe had been brough ill ill.
You could she gone into the street looking careless.He has to stoverstepposing so full do. He may, pedoubled for thedoor, now are well
fact, if the discovered looking at that. Here. In this very minute and
looked me, hardly wishionfoundictive angle had a sudden sittinct visitedRaskolnikov steps from thefiftyhussing the stairsburg? he thought
to himself. Huld you give me with a rush at motioned her took no need
for certain a week of discussion and, as it were at, if only from pursuing them allce uponRagainrevezed the His tongue out of the room full
possession of a s
Appendix E
Source Coding Evaluation
Environment
E.1
Screen Shots
Figure E.1. Creating source coders in SCEE
115
116
Source Coding Evaluation Environment
Figure E.2. Visualization of bit allocation in SCEE
E.1 Screen Shots
Figure E.3. Steganography and entropy estimation in SCEE
117
Appendix F
Acronyms
3GPP 3rd Generation Partnership Project. A standardizations group aimed at
the third generation (3G) telephony standards.
AES Advanced Encryption Standard. A common symmetric cipher meant to
replace DES.
BWT Burrows-Wheeler Transform A transform simplifying source coding of a
set of data.
CBC Cipher Block Chaining. A block cipher operation mode.
CFB Cipher FeedBack. A block cipher operation mode.
CSD Circuit Switched Data. A GSM data channel service.
CTR CounTeR. A block cipher operation mode.
DES Data Encryption Standard. A common although slightly outdate
symmetric encryption algorithm.
DH Diffie-Hellman. A commonly used key-exchange algorithm.
DMC Dynamic Markov Coding. A source coding method.
DoS Denial-of-Service. A attack on a system aimed to acheive loss of service to
its users.
ECB Electronic CodeBook. A block cipher operation mode.
ECC Elliptic Curve Cryptography. A asymmetric cipher.
ECDH Elliptic Curve Diffie-Hellman A version of the Diffie-Hellman key
agreement protocol using Elliptic Curve Cryptography.
EMSG Encrypted short MeSsaGe. A Sectra-specific message protocol for
transmission of encrypted short messages.
118
119
GPRS General Packet Radio Service A data transmission service available in
some 2G telephony networks (often reffered to as 2.5G).
GSM Global System for Mobile communication. A second genereation (2G)
wireless telephony standard.
HMAC keyed-Hash Message Authentication Code. A cryptographic hash
method using a shared secret and a cryptographic hahs algorithm.
IV Initialisation Vector. A random bit-vector used in conjunction with a
cryptographic key.
KDC Key Distribution Center. A trusted central authority handling session
keys.
LFF Longest Fragment First. A parsing method.
LFSR Linear Feedback Shift Register A register with a feedback mechanism that
generates a series of ones and zeros. Commonly used to generate
pseudo-random sequences.
LOE Local Order Estimation. An method to derive intial prediciton context in
PPM.
MAC Message Authentication Code. A cryptographic hash to protect a message
from tampering.
MITM Man-In-The-Middle. A form of cryptographic attack.
MD5 Message-Digest algorithm 5. A cryptographic hash algorithm.
MMS Multimedia Messaging Standard. A standard enabling sending of
multimedia objects over telephony networks.
NIST National Institute of Standards and Technology. A US standardization
agency.
Nonce Number used ONCE. An IV guaranteed to be used only once.
OFB Output FeedBack. A block cipher operation mode.
PPM Prediction by Partial Match. A data source coding method.
RC4 Rivest Cipher 4. A commonly used although not very secure stream cipher.
RFC Request For Comment. A formalized memoranda addressing Internet
standards.
RSA Rivest-Shamir-Adleman A common asymmetric cipher.
SDS Short Data Service. A text messaging service available in TETRA
networks.
120
Acronyms
SEE Secondary Escape Estimation An adaptive scheme to derive escape
probabilities.
SHA-1 Secure Hash Algorithm 1. A cryptographic hash algorithm.
SMS Short Message Service. A message service available in GSM networks.
TETRA TErrestial Trunked RAdio. A personal mobile radio standard used by
police, ambulance, fire departments and military.
UDH User Data Header. A header available in SMS and SDS messages used to
signal message concatenation and multimedia message content.
UMTS Universal Mobile Telecommunications System A third genereation (3G)
wireless telephony standard.
WAP Wireless Application Protocol. A protocol aimed at enabling Internet
access from mobile units such as cellular phones.
Upphovsrätt
Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare —
under 25 år från publiceringsdatum under förutsättning att inga extraordinära
omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en
senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman
i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form
eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller
konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and
to use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses of
the document are conditional on the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security
and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity, please
refer to its www home page: http://www.ep.liu.se/
c David Hertz
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement