## Monday, February 27, 2017

### a bit of algo: LZW Compression

LZW compression is a magic show. I am left awe-struck as the algorithm compresses and uncompresses bits of data. The magic is not in the fact that it does compress the data, no, that's the boring part. The magic is in the simplicity and elegance of it.

Compression
Given a string "bananababa", the algorithm starts with a code table. We initialize the code table with the following entries:

0 - a
1 - b
2 - n

This is called the initial code table.

LZW then iterates through the letters in the string while also updating the code table. The algorithm is simply:

buffer := empty;
for c in str:
if buffer + c is in code_table:
buffer := buffer + c;
else:
output code_table[buffer];
add (next_index_number++, buffer + c) into code_table;
buffer := c;
output code_table[buffer];

That's it! So with our example "bananababa", we get:

1. b
buffer := b;
2. a  [ba]
buffer := a;
output := 1;
3. n  [ban]
buffer := n;
output := 1, 0;
4. a  [bana]
buffer := a;
output := 1, 0, 2;
5. n  [banan]
buffer := an;
6. a  [banana]
buffer := a;
output := 1, 0, 2, 4;
7. b  [bananab]
buffer := b;
output := 1, 0, 2, 4, 0;
8. a  [bananaba]
buffer := ba;
9. b  [bananabab]
buffer := b;
output := 1, 0, 2, 4, 0, 3;
10. a [bananababa]
buffer := ba;
11. EOF
flush buffer
output := 1, 0, 2, 4, 0, 3, 3;

Hence the end result of the compression from [1,0,2,0,2,0,1,0,1,0] (bananababa) is [1,0,2,4,0,3,3].
As the size of code_table can get really big, usually a simple scheme such as limiting the code_table size to some threshold is used.

Decompression
You might think that the decompression is simple because we just need to perform code lookup on the entire code table. However the decompression actually needs only the compressed result and the same initial code table! The procedure will rebuilt exactly the same code_table in the end after reading the decompressed data.

The decompression algorithm is magical:

prev_code := compressed_data;
for code in compressed_data[1, end]:
output code_table[prev_code];
if code in code_table:
next_entry := code_table[prev_code] + first letter of code_table[code];
else:
// Special case, see below.
next_entry := code_table[prev_code] + first letter of code_table[prev_code];
prev_code := code;
output code_table[prev_code];

The first case of the if statement can be intuitively understood as this: In the compression algorithm, we are outputting code only when buffer + c is not found in code_table. Hence in the decompression procedure, we are simply reverting the compression logic: code_table[prev_code] is exactly the "buffer" and the first letter of code_table[code] is c.

Example decompression using [1,0,2,4,0,3,3] and the same initial code table:

0 - a
1 - b
2 - n

1. Initialize prev_code = 1.
2. 0  [1,0]
output := b;
prev_code := 0;
3. 2  [1,0,2]
output := ba;
prev_code := 2;
4. 4 [1,0,2,4]
output := ban;
prev_code := 4;
5. 0 [1,0,2,4,0]
output := banan;
prev_code := 0;
6. 3 [1,0,2,4,0,3]
output := banana;
prev_code := 3;
7. 3 [1,0,2,4,0,3]
output := bananaba;
prev_code := 3;
8. EOF: flush prev_code;
output := bananababa;

The else branch of the if condition is a special case in which 'code' is not found in our code_table yet. Consider string "abababab", starting with code table (0, a), (1, b):

The compression produces:
1. a --> buffer := a
2. b --> buffer := b   add (2, ab)    output 0
3. a --> buffer := a   add (3, ba)    output 1
4. b --> buffer := ab
5. a --> buffer := a   add (4, aba)  output 2
6. b --> buffer := ab
7. a --> buffer := aba
8. b --> buffer := b   add (5, abab) output 4
9. EOF -->  flush buffer,                output 1
with compression result [0, 1, 2, 4, 1].

The decompression procedure without the "else" branch is hence:
1. Initialize prev_code := 0;
2. 1  [0, 1]
output := a;
prev_code := 1;
3. 2  [0, 1, 2]
output := ab;
prev_code := 2;
4. 4 [0, 1, 2, 4]
output := abab;
add (4, code_table + first_letter(code_table)) <<<<< ERROR!
-----

Hence we need to handle this special case. In short, the key observation is that the first letter of code_table['prev_code'] must be the same as that of 'code' for this to happen.

This scenario happens when 'code' is exactly the next index number entry to the code_table*. Meaning, it must be the newest entry of the compression procedure before 'code' is output. Let X be that newest entry, and Y be the buffer value just before encountering c such that buffer+c is not in the code_table. Then c must be X, such that Y + X is X. This means that X starts and ends with the same character. Finally, it means that the first character of Y is the same as that of X. Therefore we can use the first letter of prev_code as the proxy to the real value of first_letter(code_table['code']).

[*Additional note: How do we know that is true? Suppose the the compression algorithm outputs code C as the n-th code output, then its code_table must has n entries (ignoring initial entries). Hence C <= n. However when the decompression reads in C, it has only n-1 code_table entries (ignoring initial entries). Hence for C to be bigger than the code table entry, it must be > n-1 but it cannot be larger than n. Hence C = n.]

That's all! LZW is quite easy to implement and it's a good algorithm to know since it might come in handy one day.