Homework #3 Programming Assignment Solution¶

In this homework, we will reuse the functions from the last two homeworks. (Tunstall coding, binary Huffman coding)

using PyPlot
using NBInclude
nbinclude("Basic Functions.ipynb");
nbinclude("Binary Huffman Coding.ipynb");
nbinclude("Tunstall Coding.ipynb");

WARNING: Method definition entropy(Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[1]:2 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[1]:2.
WARNING: Method definition entropy(Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[1]:5 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[1]:5.
WARNING: Method definition avgLength(Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[2]:1 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[2]:1.
WARNING: Method definition blockPmf(Any, Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[3]:5 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[3]:5.
WARNING: Method definition blockPmf(Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[3]:16 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[3]:16.
WARNING: Method definition minValues(Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[4]:4 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[4]:4.
WARNING: Method definition testScheme(Any...) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[5]:2 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[5]:2.
WARNING: Method definition testMultiSchemes(Any, Any, Any) in module Main at /home/juser/ECE 154C/Basic Functions.ipynb:In[6]:2 overwritten at /home/juser/ECE 154C/Basic Functions.ipynb:In[6]:2.

Problem 1. Tunstall Coding¶

In this problem, we want to combine Tunstall coding with (binary) Huffman coding to find a better variable-to-variable encoding scheme. For $\texttt{src}\_\texttt{pmf} = \{0.9, 0.1\},$ using the functions $\texttt{Tunstall(src}\_\texttt{pmf, num}\_\texttt{phrases)}$ and $\texttt{binaryHuffman(pmf)},$ find the average length of codewords per average length of symbols for $\texttt{num}\_\texttt{phrases} = 1$ to $1000,$ and plot it. Does it converge? If so, what is the limit?¶

src_pmf = [0.9, 0.1]
avg_lens = zeros(1000)
for k = 2:1000
    phrases, pmf = Tunstall(src_pmf, k)
    code = binaryHuffman(pmf)
    avg_len_codewords = avgLength(code,pmf)
    avg_len_phrases = avgLength(phrases, pmf)
    avg_lens[k] = avg_len_codewords/avg_len_phrases
end

plot(3:1000, avg_lens[3:1000], marker = "None", label = "Compression ratio")
axhline(entropy(src_pmf), color = "g", linestyle = "--", label = "Entropy")
xlabel("num_phrases")
ylabel("compression ratio")
legend(loc = "upper right", fancybox = "true")

PyObject <matplotlib.legend.Legend object at 0x7fdf1852a190>

As one can check, the compression ratio seems to converge the entropy, though it seems to fluctuate.

avg_lens[end] - entropy(src_pmf)

0.0023470495135243796

Problem 2. (Difficult) Tunstall+Huffman vs. Optimal branching¶

Note that Tunstall coding scheme may not give the optimal branching of probabilities. Hence in this problem, we want to find the optimal branching for the given number of phrases. Write a code that compares the Tunstall+Huffman with the (optimal branching)+Huffman in terms of the average length of codewords per average length of symbols, for a given number of phrases.¶

(a) For $\texttt{src}\_\texttt{pmf} = \{ 0.9, 0.1\},$ find the smallest number of phrases such that the optimal branching is strictly better than Tunstall coding.¶

(b) Repeat the experiment for $\texttt{src_pmf} = \{ 0.7, 0.3\}.$¶

Find the optimal tree for the given number of phrases.¶

Suppose a Bernoulli source, say $\mathcal{X} = \{0,1\}$. How can we evaluate all possible trees? Here we assume a binary tree. Let $d$ be the number of phrases. we can enumerate all possibilites by induction.

The inputs are pmf = $[p, 1-p]$ and num_phrases = (the number of phrases).

The outputs are all possible pmf of size num_phrases, and corresponding phrases.

Example: How can we enumerate all possible of trees (or pmfs)?¶

Let pmf1 = [9, 1].

k=1, [[9,1]]

k=2, we want [[81,9,1], [9,9,1]], 1 submatrix of size (2, 3).

k=3, we want [[729, 81, 9, 1], [81, 81, 9, 1], [81, 9, 9, 1]] + [[81, 9, 9, 1], [9, 81, 9, 1], [9, 9, 9, 1]], 2 submatrices of size (3, 4).

k=4, 6 submatrices of size (4, 5).

k=5, 24 submatrices of size (5, 6).

For general k, (k-1)! submatrices of size (k, k+1) is needed.

Suppose we have pmf_list from the previous iteration, which is a list of pmfs. What we want to do is to pick each pmf in the list, enumerate all extension, and add it. Note that there exists a duplicate, but for our purpose this construction is enough.

Code¶

# pmf_list is the list of all possible distributions of a given number of phrases.
# phrase_list is the list of all possible distributions of a given number of phrases.
# We will keep updating pmf_list and phrase_list through the iterations.
# We save newly extended pmfs and phrases in pmf_list_new, phrases_list_new.

for p1 in [0.9, 0.7]
    pmf1 = [p1, 1-p1]
    symbol1 = ["A", "B"]
    phrases_list = [symbol1]
    pmf_list = [pmf1]
    max_num_phrases = 10
    for k = 3:max_num_phrases
        pmf_list_new = Array{Float64,1}[]
        phrases_list_new = Array{String,1}[]
        for j = 1:length(pmf_list)  # over all possible distributions
            for i = 1:length(pmf_list[j])  # at each symbol s, extend it to s"A" and s"B"
                pmf_new = copy(pmf_list[j])
                phrases_new = copy(phrases_list[j])
                splice!(pmf_new, i, pmf_new[i]*pmf1)
                splice!(phrases_new, i, map(x -> "$(phrases_new[i])$x", symbol1))
                push!(pmf_list_new, pmf_new)
                push!(phrases_list_new, phrases_new)
                # to check whether it is generating a correct pmf
    #             print("kji: ", k, j, i, " ")
    #             print("newly generated ", pmf_new, " ")
    #             print("newly generated phrases ", phrases_new, "\n")
            end
        end
        pmf_list = copy(pmf_list_new)
        phrases_list = copy(phrases_list_new)
        # if you want to check the generated pmfs
    #     print(pmf_list, "\n")  

        # Below is to compare the performance of Tunstall+Huffman with the optimal tree.
        avg_lengths = zeros(length(pmf_list))
        for l in 1:length(pmf_list)
            phrases, pmf = phrases_list[l], pmf_list[l]
            avg_lengths[l] = avgLength(binaryHuffman(pmf), pmf)/avgLength(phrases, pmf)
        end

        phrases, pmf = Tunstall(pmf1, k)
        avg_length_Tunstall = avgLength(binaryHuffman(pmf), pmf)/avgLength(phrases, pmf)
        min_avg_length, min_index =  findmin(avg_lengths)
        print("------------------------------\n")
        print("For the number of phrases = ", k, ", \n")
        print("pmf by Tunstall: ", pmf, "\n")
        print("Tunstall+Huffman average length: ", avg_length_Tunstall, "\n")
        print("The optimal average length: \t ", min_avg_length, "\n")
        print("Phrases by Tunstall ", phrases, "\n")
        print("Phrases by the optimal tree ", phrases_list[min_index], "\n")
        if abs(avg_length_Tunstall-minimum(avg_lengths)) < 1e-7
            print("===> Tunstall+Huffman is optimal!\n")
        else
            print("===> There is a better tree when the number of phrases = ", k, "!\n")
            break
        end
    end
    print("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")
end

------------------------------
For the number of phrases = 3, 
pmf by Tunstall: [0.1,0.81,0.09]
Tunstall+Huffman average length: 0.6263157894736842
The optimal average length: 	 0.6263157894736842
Phrases by Tunstall String["B","AA","AB"]
Phrases by the optimal tree String["AA","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 4, 
pmf by Tunstall: [0.1,0.09,0.729,0.081]
Tunstall+Huffman average length: 0.5321033210332103
The optimal average length: 	 0.5321033210332102
Phrases by Tunstall String["B","AB","AAA","AAB"]
Phrases by the optimal tree String["AAA","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 5, 
pmf by Tunstall: [0.1,0.09,0.081,0.6561,0.0729]
Tunstall+Huffman average length: 0.4907822041291073
The optimal average length: 	 0.4907822041291073
Phrases by Tunstall String["B","AB","AAB","AAAA","AAAB"]
Phrases by the optimal tree String["AAAA","AAAB","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 6, 
pmf by Tunstall: [0.1,0.09,0.081,0.0729,0.59049,0.06561]
Tunstall+Huffman average length: 0.478017630827086
The optimal average length: 	 0.4780176308270859
Phrases by Tunstall String["B","AB","AAB","AAAB","AAAAA","AAAAB"]
Phrases by the optimal tree String["AAAAA","AAAAB","AAAB","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 7, 
pmf by Tunstall: [0.1,0.09,0.081,0.0729,0.06561,0.531441,0.059049]
Tunstall+Huffman average length: 0.47287043894152053
The optimal average length: 	 0.47287043894152053
Phrases by Tunstall String["B","AB","AAB","AAAB","AAAAB","AAAAAA","AAAAAB"]
Phrases by the optimal tree String["AAAAAA","AAAAAB","AAAAB","AAAB","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 8, 
pmf by Tunstall: [0.1,0.09,0.081,0.0729,0.06561,0.059049,0.478297,0.0531441]
Tunstall+Huffman average length: 0.4725119133852185
The optimal average length: 	 0.4725119133852184
Phrases by Tunstall String["B","AB","AAB","AAAB","AAAAB","AAAAAB","AAAAAAA","AAAAAAB"]
Phrases by the optimal tree String["AAAAAAA","AAAAAAB","AAAAAB","AAAAB","AAAB","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 9, 
pmf by Tunstall: [0.1,0.09,0.081,0.0729,0.06561,0.059049,0.0531441,0.430467,0.0478297]
Tunstall+Huffman average length: 0.4755825156265365
The optimal average length: 	 0.4755825156265365
Phrases by Tunstall String["B","AB","AAB","AAAB","AAAAB","AAAAAB","AAAAAAB","AAAAAAAA","AAAAAAAB"]
Phrases by the optimal tree String["AAAAAAAA","AAAAAAAB","AAAAAAB","AAAAAB","AAAAB","AAAB","AAB","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 10, 
pmf by Tunstall: [0.1,0.09,0.081,0.0729,0.06561,0.059049,0.0531441,0.0478297,0.38742,0.0430467]
Tunstall+Huffman average length: 0.4780791540381765
The optimal average length: 	 0.4737265617068646
Phrases by Tunstall String["B","AB","AAB","AAAB","AAAAB","AAAAAB","AAAAAAB","AAAAAAAB","AAAAAAAAA","AAAAAAAAB"]
Phrases by the optimal tree String["AAAAAAA","AAAAAAB","AAAAAB","AAAAB","AAAB","AABA","AABB","ABA","ABB","B"]
===> There is a better tree when the number of phrases = 10!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
------------------------------
For the number of phrases = 3, 
pmf by Tunstall: [0.3,0.49,0.21]
Tunstall+Huffman average length: 0.8882352941176473
The optimal average length: 	 0.8882352941176471
Phrases by Tunstall String["B","AA","AB"]
Phrases by the optimal tree String["AA","AB","B"]
===> Tunstall+Huffman is optimal!
------------------------------
For the number of phrases = 4, 
pmf by Tunstall: [0.3,0.21,0.343,0.147]
Tunstall+Huffman average length: 0.9132420091324203
The optimal average length: 	 0.9005235602094241
Phrases by Tunstall String["B","AB","AAA","AAB"]
Phrases by the optimal tree String["AA","ABA","ABB","B"]
===> There is a better tree when the number of phrases = 4!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Problem 3. Lempel-Ziv Encoding¶

In this homework, we implement the sliding-window LZ77 encoding scheme. (Lempel and Ziv, 1977)

(a) Find the longest match.¶

Write a program for a function $\texttt{MatchLengthPosition(window, text)}$ that takes a “window” of string (say, English text) $\texttt{window}$ and a string (say, a text sample) $\texttt{text}$ as inputs. If there is a match, the function returns a flag bit 1, the starting position of the longest match in the window (with the position of the rightmost symbol in the window taken as 0), and the match length (according to the LZ77 parsing). If there is a tie for the longest match, then pick the rightmost position. If there is no match, this function returns a flag bit 0, and the first character of $\texttt{text}$. For example,¶

$[1,2,6] = \texttt{MatchLengthPosition}(\texttt{‘‘MY ’’,‘‘MY MY WHAT A HAT IS THAT’’})$ and¶

$[0,\texttt{‘B’}] = \texttt{MatchLengthPosition}(\texttt{‘‘AAAA’’,‘‘BABBAA’’}).$¶

function matchPositionLength(window, text)
    if length(window) == 0
        return (0, text[1])
    end
    merged = "$window$text"
    indices = find(x -> x==text[1], [window...])
    if length(indices) > 0
        matched_len_list = zeros(length(indices))
        for i in 1:length(indices)
            idx = indices[i]
            len_temp = 1
            while len_temp+1 <= length(text)
                if merged[idx+len_temp] == text[len_temp+1]
                    len_temp += 1
                else
#                     print(merged[idx:idx+len_temp-1])  # print the matched sequence; for debugging
                    break
                end
            end
            matched_len_list[i] = len_temp
        end
#         print(matched_len_list)  # for debugging
        matched_len = maximum(matched_len_list)  # find the length of the longest match
        j = maximum(find(x -> x == matched_len, matched_len_list))  # find the largest index of the longest match
        return (1, length(window)-indices[j], Int(matched_len))
    else # if there is no match
        return (0, text[1])
    end
end

matchPositionLength (generic function with 1 method)

Examples¶

matchPositionLength("MY ", "MY MY WHAT A HAT IS THAT")

(1,2,6)

matchPositionLength("AAAA","AAABAA")

(1,0,3)

matchPositionLength("AA", "CAA")

(0,'C')

(b) Encoding the match length.¶

Using the function in part (a), write a program for a function $\texttt{ParseSWLZ(InputText, WindowSize)}$ that takes a string of characters (say, a sample of English text) $\texttt{InputText}$ and a positive integer $\texttt{WindowSize}$ as inputs, and returns the encoding by the sliding window Lempel–Ziv algorithm (using a window of size $\texttt{WindowSize}$). For example, if we input $\texttt{InputText} = \texttt{‘‘MY MY WHAT A HAT IS THAT’’}$ and $\texttt{WindowSize} = 16,$ then the function should output the following.¶

$(0,\texttt{‘M’}),\ (0,\texttt{‘Y’}),\ (0,\texttt{‘ ’}),\ (1,2,3),\ (0,\texttt{‘W’}),\ (0,\texttt{‘H’}),\ (0,\texttt{‘A’}),¶

(0,\texttt{‘T’}), (1,4,1), (1,2,1), (1,1,1), (1,5,4), (0,\texttt{‘I’}), (0,\texttt{‘S’}),\ (1,2,1),\ (1,4,1),\ (1,7,3)$

Simply, we can encode a length $L$ with $2\log L$ bits.

function lengthEncoding1(l)
    enc = ["$c$c" for c in bin(l)]
    enc[end] = (enc[end] == "00" ? "01" : "10");
    return join(enc)
end

lengthEncoding1 (generic function with 1 method)

For example,

lengthEncoding1(5)

"110010"

Similarly, there is another scheme to encode the length of the binary representation. We can simply add that number of 0's and 1 in front of the binary representation! This will also needs $2\log L$ bits.

function lengthEncoding2(l)
    s = bin(l)
    return join([join(['0' for i in 1:length(s)]), '1', s])
end

lengthEncoding2 (generic function with 1 method)

For example,

lengthEncoding2(5)

"0001101"

There is, however, a more clever way using about $\log L + 2\log\log L$ bits.

# Using log(l)+2log(log(l))+O(1) bits
function lengthEncoding3(l)
    s = bin(l)
    return join([lengthEncoding2(length(s)), s])
end

lengthEncoding3 (generic function with 1 method)

For example,

lengthEncoding3(5)

"00111101"

From now on, let's stick with the simplest scheme.

lengthEncoding = lengthEncoding1

lengthEncoding1 (generic function with 1 method)

(c) First, parse a given string according to LZ77 parsing rule.¶

Write a program for a function $\texttt{LengthEncoding(length)}$ that takes any positive integer $\texttt{length}$ as input and outputs the prefix-free binary encoding of $\texttt{length}$ according to the scheme discussed in class. For example,¶

$110010 = \texttt{LengthEncoding}(5)$ and¶

$11110010 = \texttt{LengthEncoding}(13).$¶

function parseStringLZ77(input_text, window_size)
    parsing = Array{Any}(0)  # same as []
    pointer = 1
    while pointer <= length(input_text)
        if pointer > window_size
            window = input_text[pointer-window_size:pointer-1]
        else
            window = input_text[1:pointer-1]
        end
        code = matchPositionLength(window, input_text[pointer:end])
        pointer += (code[1]==1 ? code[3] : 1)  # code[3] is the matched length.
        append!(parsing, [code])
    end
    return parsing
end

parseStringLZ77 (generic function with 1 method)

Examples¶

parseStringLZ77("AABBABAABABB", 6)

8-element Array{Any,1}:
 (0,'A')
 (1,0,1)
 (0,'B')
 (1,0,1)
 (1,2,2)
 (1,5,3)
 (1,1,2)
 (1,0,1)

parseStringLZ77("MY MY WHAT A HAT IS THAT", 16)

17-element Array{Any,1}:
 (0,'M')
 (0,'Y')
 (0,' ')
 (1,2,3)
 (0,'W')
 (0,'H')
 (0,'A')
 (0,'T')
 (1,4,1)
 (1,2,1)
 (1,1,1)
 (1,5,4)
 (0,'I')
 (0,'S')
 (1,2,1)
 (1,4,1)
 (1,7,3)

(d) Finally, output the final encoded bits.¶

Using the functions above, write a program for a function $\texttt{SWLZ(InputText, WindowSize)}$ that takes the same input as before, but returns the binary encoding of the string according to the sliding window Lempel–Ziv algorithm (using a window of size $\texttt{WindowSize}$) and the compression ratio. For the same input as before, $\texttt{InputText} = \texttt{‘‘MY MY WHAT A HAT IS THAT’’}$ and $\texttt{WindowSize} = 16,$ it should output¶

0100110101011001001000001001011100101011101001000010000010101010010100101001010100011010101110001010010010101001110010101010010101111110¶

# A function maps a character to 7-bit ascii
char2ascii(c) = bin(convert(Int, c), 7)

char2ascii (generic function with 1 method)

We can implement the final encoding function using parseStringLZ77.

function encodeLZ77a(input_text, window_size)
    parsing = parseStringLZ77(input_text, window_size)
    binary_code = []
    for code in parsing
        if code[1] == 1
            pos, len = code[2:3]
            temp = join([string(1), bin(pos, Int(ceil(log2(window_size)))), lengthEncoding(len)])
        else
            temp = join([string(0), char2ascii(code[2])])
        end
        append!(binary_code, [temp])
#         print(code, " ", temp, "\n")  # for debugging
    end
    return join(binary_code)
end

encodeLZ77a (generic function with 1 method)

Of course, we can implement it in one shot as follows:

function encodeLZ77b(input_text, window_size)
    binary_code = []  # initialize the window
    pointer = 1;
    while pointer <= length(input_text)
        if pointer > window_size
            window = input_text[pointer-window_size:pointer-1]
        else
            window = input_text[1:pointer-1]
        end
        temp = matchPositionLength(window, input_text[pointer:end])
        if temp[1] == 1
            flag, pos, len = temp
            output = join([string(flag), bin(pos, Int(ceil(log2(window_size)))), lengthEncoding(len)])
            pointer += len
        else
            output = join([string(0), char2ascii(input_text[pointer])])
            pointer += 1
        end
        append!(binary_code, [output])
#         print(pointer, " ", output,"\n")  # for debugging
    end
    return join(binary_code)
end

encodeLZ77b (generic function with 1 method)

Examples¶

encodeLZ77a("MY MY WHAT A HAT IS THAT", 16)

"0100110101011001001000001001011100101011101001000010000010101010010100101001010100011010101110001010010010101001110010101010010101111110"

encodeLZ77b("MY MY WHAT A HAT IS THAT", 16)

"0100110101011001001000001001011100101011101001000010000010101010010100101001010100011010101110001010010010101001110010101010010101111110"

(e) Practice¶

Load the file $\texttt{sample}\_\texttt{text.txt}$ into a string $\texttt{s}.$ For example, in MATLAB, this can be done using the following code.¶

$\texttt{FID = fopen(‘sample}\_\texttt{text.txt’, ‘rb’);}$¶

$\texttt{s = fread(FID, [1, inf], ‘char’);}$¶

$\texttt{fclose(FID);}$¶

Use the function $\texttt{SWLZ}$ to compress s using the sliding-window Lempel–Ziv algorithm, and find the compression ratios (in number of bits per symbol) for window sizes $\texttt{256},$ $\texttt{512},$ $\texttt{1024},$ $\texttt{2048},$ and $\texttt{4096}.$ What do you observe? (You do not have to show the full binary encoding of the text.)¶

f = open("sample_text.txt")
str = readstring(f)
close(f)

isascii(str)

true

print("We are using 7-bits ascii.\n-----------------------------------\n")
print("The input text has ", length(str), " characters.\n")
for window_size in [256, 512, 1024, 2048, 4096]
    code = encodeLZ77a(str, window_size)
    print("For a window size of ", window_size, ", the compressed length is ", length(code), " bits and the compression ratio is ", length(code)/length(str), " bits per symbol, or ", length(code)/length(str)/7*100, "%.\n")
end

We are using 7-bits ascii.
-----------------------------------
The input text has 6281 characters.
For a window size of 256, the compressed length is 38103 bits and the compression ratio is 6.066390702117498 bits per symbol, or 86.66272431596425%.
For a window size of 512, the compressed length is 36452 bits and the compression ratio is 5.803534469033593 bits per symbol, or 82.90763527190848%.
For a window size of 1024, the compressed length is 35408 bits and the compression ratio is 5.637318898264607 bits per symbol, or 80.53312711806582%.
For a window size of 2048, the compressed length is 34026 bits and the compression ratio is 5.417290240407579 bits per symbol, or 77.38986057725113%.
For a window size of 4096, the compressed length is 34269 bits and the compression ratio is 5.455978347396911 bits per symbol, or 77.94254781995588%.

We see that the compression ratio decreases with increase in window size till 2048, after which it increases. So, choosing an appropriate window size involves a tradeoff.