(I need name suggestions for this project!)
Note
This project is the second version of my original compression project here.
The biggest difference between the old and current versions is:
- Program now compressed files based on their byte content (
Vec<u8>
), rather than strings (&str
) - The repository is better structured
- Current version now has Huffman Encoding, which plays a huge role in compression rate
How to use
The project can be built with cargo
.
Optionally, the project can be compiled for fuzzing via cargo afl build
,
though this requires
Cargo support for AFL++.
Here’s a list of supported CLI flags, which can be seen with
./compression-v2 --help
:
Compresses an input file into an output (with extension .pkz)
Usage: compression-v2 [OPTIONS] <INPUT_PATH>
Arguments:
<INPUT_PATH>
File to compress
Options:
-d, --decompress
Decompress instead of Compress. Expects a .pkz file as input
-s, --stdout
Redirect output to stdout. Does not create a .pkz file
-q, --quiet
Hide debug output
-p, --pipeline <PIPELINE>...
Provide a custom Encoding Pipeline in a space-separated list. Ignored if --decompress is used.
Possible options (also the default): Bwt Mtf Rle Huff
-c, --check-integrity
Performs the compression and verifies that it decodes to the original content. Ignored if --decompress is used
-h, --help
Print help (see a summary with '-h')
As mentioned above, standard compression outputs a .pkz
file, while
decompression necessarily requires a .pkz
file as input, as to avoid
accidental decoding.
How it works
This section will cover my thought process while implementing each of these algorithms, along with some notes/discoveries I made along the way.
Encoding Pipeline
This project, like its predecessor, supports a variable encoding pipeline. This
means that each stage of (de)compression strictly takes in a Vec<u8>
and
outputs a Vec<u8>
, allowing for encoders to be chained together.
Custom encoding pipelines can be specified with the -p, --pipeline
flag,
followed by a list of encodings. Encodings can be repeated and used in any
order, but must be one of the following:
- BWT (Burrows-Wheeler Transform)
- MTF (Move-To-Front)
- RLE (Run-Length-Encoding)
- HUFF (Huffman Coding)
(In the future, I plan to add BBWT
, a blocked version of the BWT, which would
allow for parallel processing)
BWT
The Burrows-Wheeler-Transform is a process which permutes a sequence of bytes in such a way that produces long runs of repeated bytes. As you can imagine, this sets up the scene for encoders like MTF and RLE, which benefit from such patterns.
The true BWT relies on an end-of-sequence character, typically denoted by $
,
which is defined to be lexicographically “less than” all other characters. The
first challenge while implementing this algorithm was figuring out a way to
denote the existence of $
in the encoded output.
The only character than can meet this “infimum” property is 0x0
, though in
many programs, 0x0
is the most common byte. In fact, regardless of what
character I choose here, I will always have to face the problem of “escaping”.
To imagine this, consider the input sequence: banana
. To prep this sequence
for the BWT, let’s append an arbitrary end-of-sequence character, which for the
sake of this example, will happen to be a
. Thus, what gets inputted to the
core BWT algorithm is the string bananaa
.
After permuting the string, we will theoretically get, annb$aa
, but due to our
explicitly chosen delimiter, we instead get annbaaa
. But how can we even begin
to decode this?
As mentioned before, we can remedy the situation by “escaping” the character we
choose as our delimiter. Hence, to prep banana
, we need to perform 2
replacements:
- Replace all occurrences of
\
with\\
- Replace all occurrences of
a
with\a
The first replacement is required to allow the second one to work, as without
it, the input banan\a
would contain a sequence \\a
post-prep. And by
definition of “escape”, the first \
escapes the second \
, which is NOT what
we want.
Turns out, all of this horrible delimiter nonsense can be avoided by using a smarter representation. Since we know there is only one delimiter in the BWT output, we can get away with only knowing where it is.
To do this, the in-memory program will view the input as a list of Tokens, where a Token Enum is defined as:
enum BwtToken {
Delim,
Byte(u8),
}
After it finishes processing, the program will convert the Vec<BwtToken>
to a
Vec<u8>
, and simply skip the BwtToken::Delim
case. The 0-indexed position of
the delimiter is then prepended to the output in the form: position|output
,
where position
is written as a base-36 integer.
So, our previous example of banana
would be encoded as 4|annbaa
.
MTF
The Move-To-Front transform is another encoding process, which aims to take advantage of “recently used symbols”. The details of this algorithm can be found at the Wikipedia page (liked).
The above link discusses a common approach of using a “pre-defined alphabet”, which is manually constructed via textual tendencies within most data. My approach uses the implementation described in Brandon Simmons’ blog post here.
There weren’t any interesting things to note from this section, the implementation was pretty one-to-one from Simmons’ website, with a few small adjustments due to Rust’s syntax.
RLE
Run-Length-Encoding is a very straightforward encoding scheme that I’m sure almost everyone is familiar with. The gist of the algorithm is:
Grunts. “Caveman see repeated sequence, Caveman replace with number.” Grunts.
In theory, it’s very simple, but implementation has some annoying
inconveniences. As the BWT foreshadowed, we have to “escape” characters. It’s
pretty easy to see why, with the following input
→ encoded
→ decoded
cases:
aaaaabbbb
→ 5a4b
→ aaaaabbbb
(correct)
33333bbbb
→ 534b
→ bbbbbbbb....b
(incorrect)
The obvious fix is to add a delimiter character before each count
number,
resulting in: aaaaabbbb
→ \5a\4b
→ aaaaabbbb
(correct) 33333bbbb
→
\53\4b
→ ?
Wait, how should the decoder interpret the second case? Is it 3
x 5 followed
by b
x 4? Or is it \
x 53 followed by b
?
Additionally, small sequences of characters face the bloating problem (aa
→
\2a
), where the replacement takes more space than the original.
In order to solve ambiguous cases like this and other special cases, I created my RLE with the following rule set:
- Replace all occurrences of
\
with\\
- Replace all occurrences of
a
with\a
- Only perform an RLE replacement if
Why constrain the replacement requirements to ? Note that aaa
→
\3a
(same size), but aaaa
→ \4a
(1 byte less). As for the upper bound of
255
(or u8::MAX
), this ensures that count
can be represented with a single
byte. Without this, we would have to deal with double delimiters (i.e
[count]byte
), which just sound like a pain in the ass, and that’s enough for
me to avoid it. Besides, a running sequence of over 255 repeated bytes is very
unlikely, meaning we’re really not losing out on much with this approach.
Fun fact: Most of my debugging painfully led me back to this encoder. It has proven time and time again to be a pain in the ass, so perhaps there is a better way to implement it.
Huffman Coding
Last, but certainly not least, we have Huffman Coding, one of the most popular compression techniques used today. Very much like MTF, the implementation details of this encoding scheme aren’t too special.
The only place where I “freestyled” was with the (de)serialization of the Huffman tree. The size (in nodes) of the Huffman tree is upper bounded by , as can be seen in the picture:
(Imagine the bottom-right node starts at 0x00
and goes up to 0xFF
, resulting
in 256 leaf nodes and 255 internal nodes)
Since the max tree size is 511 nodes, I decided to (de)serialize the tree by prepending the Pre-Order & In-Order traversals to the Huffman encoded output.
The rest of the implementation is very standard, requiring lots of bit-wise operations and some padding.