ECE 225B: Universal Probability and Applications @UCSD¶

http://circuit.ucsd.edu/~yhk/ece225b-spr18/

Homework 2: Mixture probability and context tree weighting¶

  • Due: May 22, 2018, 12:00 AM

  • Late homework will never be accepted.

  • The homework will be collected through Gradescope. Please follow the submission instruction on Piazza.

  • Please read the following problems carefully, and write down your code as clearly as possible and add suitable comments.

  • NAME:

  • PID:

In this homework, we will implement mixture probability assignments such as Krichevski-Trofimov (KT) and Laplace mixtures. Then, we will implement context tree weighting (CTW) using the KT probability assignment as a building block.

[1] Willems, Frans MJ, Yuri M. Shtarkov, and Tjalling J. Tjalkens. "The context-tree weighting method: basic properties." IEEE Transactions on Information Theory 41.3 (1995): 653-664.

[2] Willems, F. M., Tjalkens, T. J., and Ignatenko, T. (2006). Context-tree weighting and maximizing: Processing betas. In Proceedings of Information Theory Applications Workshop.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
# You may find this function useful
def chars_to_dict(chars):
    chars = sorted(list(set(chars)))
    char_to_index  = {x: i for i, x in enumerate(chars)}
    
    return char_to_index

Problem 1. (Mixture of i.i.d. probabilities)¶

In this problem, we implement a sequential probability assignment based on the mixture of i.i.d. multinomial distributions with respect to the prior $\text{Dirichlet}(\alpha,\ldots,\alpha)$. Recall that the sequential probability assignment for this mixture simply reduces to the add-$\alpha$ estimator. For example, if $\alpha=1/2$, it becomes the KT mixture, and if $\alpha=1$, it becomes the Laplace mixture.

Write a function

prob_mixture_iid_seq(string, char_to_index, alpha=1/2)

which takes string, char_to_index, and alpha as inputs, and outputs the sequential distribution dist_array. Here, alpha is a float variable, the smoothing parameter.

In [777]:
def prob_mixture_iid_seq(string, abc, alpha=1/2):
    ### FILL IN THE CODE ###

    return dist_array
In [778]:
A = '10111111'
B = '0010110111'
C = '00121212102101210'
D = 'AGTTTTCGTAACGTT'
In [769]:
prob_mixture_iid_seq(A, '01', alpha=1/2)
Out[769]:
array([[0.5       , 0.5       ],
       [0.25      , 0.75      ],
       [0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.3       , 0.7       ],
       [0.25      , 0.75      ],
       [0.21428571, 0.78571429],
       [0.1875    , 0.8125    ],
       [0.16666667, 0.83333333]])
In [770]:
prob_mixture_iid_seq(B, '01', alpha=1)
Out[770]:
array([[0.5       , 0.5       ],
       [0.66666667, 0.33333333],
       [0.75      , 0.25      ],
       [0.6       , 0.4       ],
       [0.66666667, 0.33333333],
       [0.57142857, 0.42857143],
       [0.5       , 0.5       ],
       [0.55555556, 0.44444444],
       [0.5       , 0.5       ],
       [0.45454545, 0.54545455],
       [0.41666667, 0.58333333]])
In [759]:
prob_mixture_iid_seq(B, '012', alpha=1/2)
Out[759]:
array([[0.33333333, 0.33333333, 0.33333333],
       [0.6       , 0.2       , 0.2       ],
       [0.71428571, 0.14285714, 0.14285714],
       [0.55555556, 0.33333333, 0.11111111],
       [0.63636364, 0.27272727, 0.09090909],
       [0.53846154, 0.38461538, 0.07692308],
       [0.46666667, 0.46666667, 0.06666667],
       [0.52941176, 0.41176471, 0.05882353],
       [0.47368421, 0.47368421, 0.05263158],
       [0.42857143, 0.52380952, 0.04761905],
       [0.39130435, 0.56521739, 0.04347826]])
In [760]:
prob_mixture_iid_seq(C, '012')
Out[760]:
array([[0.33333333, 0.33333333, 0.33333333],
       [0.6       , 0.2       , 0.2       ],
       [0.71428571, 0.14285714, 0.14285714],
       [0.55555556, 0.33333333, 0.11111111],
       [0.45454545, 0.27272727, 0.27272727],
       [0.38461538, 0.38461538, 0.23076923],
       [0.33333333, 0.33333333, 0.33333333],
       [0.29411765, 0.41176471, 0.29411765],
       [0.26315789, 0.36842105, 0.36842105],
       [0.23809524, 0.42857143, 0.33333333],
       [0.30434783, 0.39130435, 0.30434783],
       [0.28      , 0.36      , 0.36      ],
       [0.25925926, 0.40740741, 0.33333333],
       [0.31034483, 0.37931034, 0.31034483],
       [0.29032258, 0.41935484, 0.29032258],
       [0.27272727, 0.39393939, 0.33333333],
       [0.25714286, 0.42857143, 0.31428571],
       [0.2972973 , 0.40540541, 0.2972973 ]])
In [771]:
prob_mixture_iid_seq(D, 'AGTC')
Out[771]:
array([[0.25      , 0.25      , 0.25      , 0.25      ],
       [0.5       , 0.16666667, 0.16666667, 0.16666667],
       [0.375     , 0.125     , 0.375     , 0.125     ],
       [0.3       , 0.1       , 0.3       , 0.3       ],
       [0.25      , 0.08333333, 0.25      , 0.41666667],
       [0.21428571, 0.07142857, 0.21428571, 0.5       ],
       [0.1875    , 0.0625    , 0.1875    , 0.5625    ],
       [0.16666667, 0.16666667, 0.16666667, 0.5       ],
       [0.15      , 0.15      , 0.25      , 0.45      ],
       [0.13636364, 0.13636364, 0.22727273, 0.5       ],
       [0.20833333, 0.125     , 0.20833333, 0.45833333],
       [0.26923077, 0.11538462, 0.19230769, 0.42307692],
       [0.25      , 0.17857143, 0.17857143, 0.39285714],
       [0.23333333, 0.16666667, 0.23333333, 0.36666667],
       [0.21875   , 0.15625   , 0.21875   , 0.40625   ],
       [0.20588235, 0.14705882, 0.20588235, 0.44117647]])

Problem 2. (Mixture of Markov probabilities)¶

Now we implement a sequential probability assignment based on the mixture of $k$-th order Markov probabilities. More precisely, as we discussed in the class, given an initial state $s_1=x_{-k-1}^0$, we assign the following probability to a given string $x_1^n$: $$ q_{k,\alpha}(x^n|s_1) = \prod_{s\in \mathcal{X}^k}q_{\alpha}^s(x^n|s_1), $$ where $q_{\alpha}^s(x^n|s_1)$ is the add-$\alpha$ probability assingment for the substring of $x^n$ having a state $s\in\mathcal{X}^k$. In words, given a Markov order $k$, we simply decompose the given sequence into subsequences according to the Markov states of length $k$, and assign simple mixture probabilities for each substring. For simplicity, let us assume $s_1$ to be $k$ consecutive 0's given the alphabet is $\{0,1,\ldots,M-1\}$, and find the probability $q_{k,\alpha}(x^n|s_1)$.

Write a function

prob_mixture_markov_seq(string, char_to_index, alpha=1/2, k=0)

which takes string, char_to_index, alpha, k as inputs, and outputs the sequential distribution dist_array of the add-$\alpha$ probability of the input string $x^n$ with respect to the $k$-th order Markov model. Here, k is a int variable representing the Markov order.

In [790]:
def prob_mixture_markov_seq(string, abc, alpha=1/2, k=0, init_state=None):
    ### FILL IN THE CODE ###
    
    return dist_array
In [791]:
A, prob_mixture_markov_seq(A, chars_to_dict('012'), alpha=1/2, k=1)
Out[791]:
('10111111', array([[0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.6       , 0.2       ],
        [0.6       , 0.2       , 0.2       ],
        [0.42857143, 0.42857143, 0.14285714],
        [0.33333333, 0.55555556, 0.11111111],
        [0.27272727, 0.63636364, 0.09090909],
        [0.23076923, 0.69230769, 0.07692308],
        [0.2       , 0.73333333, 0.06666667]]))
In [793]:
A, prob_mixture_markov_seq(A, chars_to_dict('012'), alpha=1/2, k=2)
Out[793]:
('10111111', array([[0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.6       , 0.2       , 0.2       ],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.6       , 0.2       ],
        [0.14285714, 0.71428571, 0.14285714],
        [0.11111111, 0.77777778, 0.11111111],
        [0.09090909, 0.81818182, 0.09090909]]))
In [794]:
B, prob_mixture_markov_seq(B, chars_to_dict('01'), alpha=1/2, k=1)
Out[794]:
('0010110111', array([[0.5       , 0.5       ],
        [0.75      , 0.25      ],
        [0.83333333, 0.16666667],
        [0.5       , 0.5       ],
        [0.625     , 0.375     ],
        [0.75      , 0.25      ],
        [0.5       , 0.5       ],
        [0.5       , 0.5       ],
        [0.625     , 0.375     ],
        [0.5       , 0.5       ],
        [0.41666667, 0.58333333]]))
In [795]:
B, prob_mixture_markov_seq(B, chars_to_dict('012'), alpha=1/2, k=2)
Out[795]:
('0010110111', array([[0.33333333, 0.33333333, 0.33333333],
        [0.6       , 0.2       , 0.2       ],
        [0.71428571, 0.14285714, 0.14285714],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.6       , 0.2       , 0.2       ],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.6       , 0.2       ],
        [0.42857143, 0.42857143, 0.14285714],
        [0.6       , 0.2       , 0.2       ],
        [0.42857143, 0.42857143, 0.14285714]]))
In [796]:
C, prob_mixture_markov_seq(C, chars_to_dict('012'), alpha=1/2, k=3)
Out[796]:
('00121212102101210', array([[0.33333333, 0.33333333, 0.33333333],
        [0.6       , 0.2       , 0.2       ],
        [0.71428571, 0.14285714, 0.14285714],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.2       , 0.6       ],
        [0.2       , 0.6       , 0.2       ],
        [0.14285714, 0.14285714, 0.71428571],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.2       , 0.6       ],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.6       , 0.2       ],
        [0.33333333, 0.11111111, 0.55555556],
        [0.14285714, 0.42857143, 0.42857143]]))
In [52]:
C, prob_mixture_markov_seq(C, chars_to_dict('012'), beta=1/2, k=5)
Out[52]:
('00121212102101210', array([[0.33333333, 0.33333333, 0.33333333],
        [0.6       , 0.2       , 0.2       ],
        [0.71428571, 0.14285714, 0.14285714],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.2       , 0.2       , 0.6       ],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333]]))

Problem 3. (Context tree weighting)¶

We implement context tree weighting algorithm in this problem. Recall that $$ q^s_{\text{CTW}}(x^n)= \begin{cases} \displaystyle\frac{1}{2}q^s_{\text{KT}}(x^n) +\frac{1}{2}\prod_{y\in \mathcal{X}}q^{ys}_{\text{CTW}}(x^n) &\text{if }|s|<k\\ \displaystyle q^s_{\text{KT}}(x^n) &\text{if }|s|=k \end{cases} $$

Write a function

prob_ctw_seq(string, chars)

that outputs a sequential distribution dist_array produced by CTW.

NOTE: How to implement conditional probability efficiently?¶

Note that a naive implementation of sequential probability assignment \begin{align} q^{\lambda}_{\text{CTW}}(\alpha|x^n) = \frac{q^{\lambda}_{\text{CTW}}(x^n\alpha)}{q^{\lambda}_{\text{CTW}}(x^n)} \end{align} for each $\alpha\in\mathcal{X}$ using the above recursion would be really slow.

The following is a summary of Section 2 of [2], which allows us a much faster implementation.

For a given maximum depth $D$, let us assume that we are given a suffix $s_1=x_{-D-1}^{0}$, and omit the conditioning for simplicity. That is, we write $q^{\lambda}_{\text{CTW}}(\alpha|x^n)=q^{\lambda}_{\text{CTW}}(\alpha|x_{-D-1}^0x_1^n)$.

Algorithm¶

For each $n=0,1,\ldots,N$, we would like to find $q_{\text{CTW}}^{\lambda}(\cdot|x^n)$, where $\lambda$ denotes the empty string. Suppose $|s|<D$. From the recursion formula, we can write \begin{align} q_{\text{CTW}}^s(\alpha|x^n) &=\frac{q_{\text{CTW}}^s(x^n\alpha)}{q_{\text{CTW}}^s(x^n)}\\ &=\frac{q^s_{\text{KT}}(x^n\alpha)+\prod_{y\in \mathcal{X}}q^{ys}_{\text{CTW}}(x^n\alpha)}{q^s_{\text{KT}}(x^n)+\prod_{y\in \mathcal{X}}q^{ys}_{\text{CTW}}(x^n)}\\ &=\frac{\beta^s(x^n)q_{\text{KT}}^s(\alpha|x^n)+q_{\text{CTW}}^{x_{n-|s|}~ ~ ~s}(\alpha|x^n)}{\beta^s(x^n)+1}, && (\because q^{ys}_{\text{CTW}}(x^n\alpha)=q^{ys}_{\text{CTW}}(x^n) \text{ if } y\neq x_{n-|s|}) \end{align} where $$ \beta^s(x^n)=\frac{q_{\text{KT}}^s(x^n)}{\prod_{y\in\mathcal{X}}q_{\text{CTW}}^{ys}(x^n)}. $$ Hence, $q_{\text{CTW}}^{\lambda}(\cdot|x^n)$ can be found recursively as follows: $$ \begin{align} q_{\text{CTW}}^s(\cdot|x^n)= \begin{cases} \displaystyle\frac{\beta^s(x^n)q_{\text{KT}}^s(\cdot|x^n)+q_{\text{CTW}}^{x_{n-|s|}~ ~ ~s}(\cdot|x^n)}{\beta^s(x^n)+1} & \text{if }|s|<D\\ q_{\text{KT}}^s(\cdot|x^n) & \text{if }|s|=D \end{cases}, \end{align} $$ After obtaining the conditional distribution, the ratio $\beta^s(\cdot)$ for each $s$ needs to be updated with the new symbol $x_{n+1}$ if $n\le N-1$ as follows: $$ \beta^s(x^nx_{n+1}) = \beta^s(x^n) \frac{q_{\text{KT}}^s(x_{n+1}|x^n)}{q_{\text{CTW}}^{x_{n-|s|}~~~s}(x_{n+1}|x^n)}. $$

Implementation¶

To implement this, use any tree data structure storing the depth (== the length of the suffix), the counter of the symbols of the corresponding suffix, and the ratio $\beta^s$ for each node $s$.

In [880]:
def prob_ctw_seq(string, char_to_index, max_depth=5):
    num_abc = len(char_to_index)
    dist_array = np.zeros((len(string)+1, num_abc))
    ### FILL IN THE CODE ###
    
    return dist_array
In [320]:
prob_mixture_iid_seq(A, chars_to_dict('01'), beta=1/2)
Out[320]:
array([[0.5       , 0.5       ],
       [0.25      , 0.75      ],
       [0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.3       , 0.7       ],
       [0.25      , 0.75      ],
       [0.21428571, 0.78571429],
       [0.1875    , 0.8125    ],
       [0.16666667, 0.83333333]])
In [323]:
prob_ctw_seq(A, chars_to_dict('01'), max_depth=0)
Out[323]:
array([[0.5       , 0.5       ],
       [0.25      , 0.75      ],
       [0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.3       , 0.7       ],
       [0.25      , 0.75      ],
       [0.21428571, 0.78571429],
       [0.1875    , 0.8125    ],
       [0.16666667, 0.83333333]])
In [324]:
prob_ctw_seq(A, chars_to_dict('01'), max_depth=1)
Out[324]:
array([[0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.33333333, 0.66666667],
       [0.65625   , 0.34375   ],
       [0.40909091, 0.59090909],
       [0.30769231, 0.69230769],
       [0.25      , 0.75      ],
       [0.21180556, 0.78819444],
       [0.18428781, 0.81571219]])
In [800]:
prob_ctw_seq(A, chars_to_dict('01'), max_depth=2)
Out[800]:
array([[0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.41666667, 0.58333333],
       [0.64285714, 0.35714286],
       [0.4       , 0.6       ],
       [0.27604167, 0.72395833],
       [0.21942446, 0.78057554],
       [0.18317972, 0.81682028],
       [0.15726375, 0.84273625]])
In [801]:
prob_ctw_seq(A, chars_to_dict('01'), max_depth=5)
Out[801]:
array([[0.5       , 0.5       ],
       [0.375     , 0.625     ],
       [0.41666667, 0.58333333],
       [0.59821429, 0.40178571],
       [0.41111111, 0.58888889],
       [0.30896226, 0.69103774],
       [0.23805461, 0.76194539],
       [0.19204927, 0.80795073],
       [0.16016979, 0.83983021]])
In [802]:
prob_ctw_seq(B, chars_to_dict('01'), max_depth=2)
Out[802]:
array([[0.5       , 0.5       ],
       [0.75      , 0.25      ],
       [0.83333333, 0.16666667],
       [0.5625    , 0.4375    ],
       [0.63888889, 0.36111111],
       [0.67307692, 0.32692308],
       [0.5       , 0.5       ],
       [0.47794118, 0.52205882],
       [0.53169014, 0.46830986],
       [0.54135338, 0.45864662],
       [0.42418033, 0.57581967]])
In [803]:
prob_ctw_seq(B, chars_to_dict('01'), max_depth=5)
Out[803]:
array([[0.5       , 0.5       ],
       [0.75      , 0.25      ],
       [0.83333333, 0.16666667],
       [0.5625    , 0.4375    ],
       [0.63888889, 0.36111111],
       [0.63942308, 0.36057692],
       [0.5       , 0.5       ],
       [0.5       , 0.5       ],
       [0.5       , 0.5       ],
       [0.55833333, 0.44166667],
       [0.43207547, 0.56792453]])
In [804]:
prob_ctw_seq(B, chars_to_dict('012'), max_depth=5)
Out[804]:
array([[0.33333333, 0.33333333, 0.33333333],
       [0.6       , 0.2       , 0.2       ],
       [0.71428571, 0.14285714, 0.14285714],
       [0.44444444, 0.33333333, 0.22222222],
       [0.56439394, 0.29545455, 0.14015152],
       [0.53629191, 0.32061144, 0.14309665],
       [0.43818019, 0.43818019, 0.12363961],
       [0.48503079, 0.42460566, 0.09036355],
       [0.46390453, 0.45094078, 0.08515469],
       [0.45137422, 0.47190338, 0.0767224 ],
       [0.38970028, 0.54998224, 0.06031749]])
In [871]:
prob_ctw_seq(D, chars_to_dict('AGTC'), max_depth=1)
Out[871]:
array([[0.25      , 0.25      , 0.25      , 0.25      ],
       [0.5       , 0.16666667, 0.16666667, 0.16666667],
       [0.3125    , 0.1875    , 0.3125    , 0.1875    ],
       [0.26666667, 0.2       , 0.26666667, 0.26666667],
       [0.19791667, 0.13541667, 0.19791667, 0.46875   ],
       [0.1547619 , 0.10714286, 0.1547619 , 0.58333333],
       [0.125     , 0.08928571, 0.125     , 0.66071429],
       [0.23333333, 0.23333333, 0.23333333, 0.3       ],
       [0.16428571, 0.16428571, 0.17857143, 0.49285714],
       [0.09025033, 0.23517787, 0.10210804, 0.57246377],
       [0.34215328, 0.125     , 0.34215328, 0.19069343],
       [0.47230769, 0.10184615, 0.28707692, 0.13876923],
       [0.17799597, 0.16828514, 0.45630125, 0.19741764],
       [0.13076377, 0.12721684, 0.13076377, 0.61125562],
       [0.21442819, 0.21243351, 0.07613032, 0.49700798],
       [0.18797954, 0.18644501, 0.06624041, 0.55933504]])

Problem 4. (Minimax mean universality of mixtures)¶

In this problem, we will verify universality and compare the speed of convergence of the mixture strategies. As previously in Homework #1, we would like to verify that $$ \lim_{n\to\infty}\mathbb{E}_{X^n\sim \mathsf{P}}\left[\log_2\frac{p(X^n)}{q(X^n)}\right]=0 $$ for some distribution $\mathsf{P}$.

To verify this, we draw $n$ independent random variables $X^n(i)\sim \mathsf{P}$ independently for $i=1,\ldots,N$ to obtain $$ \frac{1}{N}\sum_{i=1}^N \log_2\frac{p(X^n(i))}{q(X^n(i))}\approx \mathbb{E}_{X^n\sim \mathsf{P}}\left[\log\frac{p(X^n)}{q(X^n)}\right]. $$ We will consider the following two simple probability model classes.

a) $\mathcal{P_a}$={a Bernoulli($p$) i.i.d. process $X_i\sim \text{Bern}(p)$: $p\in[0,1]$}

b) $\mathcal{P_b}$={a binary symmetric Markov chain $X^n$ of transition matrix $$ P=\begin{bmatrix} 1-p & p\\ p & 1-p \end{bmatrix}, $$ starting from the initial distribution $\pi=[1/2,1/2]$: $p\in[0,1]$}

Problem 4(a).¶

For the class of Bernoulli processes $\mathcal{P_a}$, find the approximate mean minimax redundancy $$ \frac{1}{n}\max_{p\in \{0.02,0.04,\ldots,0.5\}}\frac{1}{N}\sum_{i=1}^N \log_2\frac{p(X_{(i)}^n)}{q(X_{(i)}^n)} $$ for KT and Laplace mixture of $k$-th Markov order probabilities ($k\in\{0,1,2\}$), LZ probability assignment, CTW with max_depth$\in\{1,2\}$, for $n\in \{10, 50, 100, 500, 1000, 5000\}$, while fixing $N=50$.

Plot $\log_{10} n$ vs. $\log_{10}$ of the approximate mean minimax redundancy curve for each probability assignment in the same plot. There should be 9 curves in the same plot. Which probability assignment gives the fastest convergence rate?

In [20]:
### FILL IN THE CODE ###

Problem 4(b).¶

For each $n\in \{500, 1000, 5000\}$, find the approximate mean minimax redundancy for each add-$\beta$ mixtures for $\beta\in\{0.1,0.2,\ldots, 2.9,3\}$, and plot the values with respect to $\beta$. There should be 3 curves in the same plot. Which $\beta$ does achive the minimum mean minimax redundancy?

In [ ]:
### FILL IN THE CODE ###

Problem 4(c).¶

For the class of binary symmetric processes $\mathcal{P_b}$, repeat the same experiment in Problem 4(a).

In [ ]:
### FILL IN THE CODE ###

Problem 5. (Universal compression with CTW)¶

In this problem, we compress the same data in Homework #1.

  • For each dataset, compare the compression ratio with the number from Homework #1. Does CTW compress better than LZ?
  • For max_depth, use 5 or 10. Considering time complexity, what would be a good choice in practice?
In [733]:
from arithmetic_enc_public import *
In [734]:
def compress_with_CTW(filename, char_to_index=None, max_depth=5, verbose=False):
    file = open(filename, 'r', encoding="utf8")
    string = file.read()
    file.close()
    if char_to_index is None:
        char_to_index = chars_to_dict(string)
    
    dist_array = prob_ctw_seq(string, char_to_index, max_depth)
    encoded_bits = arithmetic_enc_offline(string, char_to_index, dist_array)
    
    if verbose:
        compress_ratio = len(encoded_bits) / (len(string) * np.log2(len(char_to_index)))
        print("The compression ratio is {}%.".format(compress_ratio*100))
        
    return compress_ratio

Problem 5(a). Genome sequence¶

In [ ]:
%time compress_with_CTW('data/CElegans_sample.txt', chars_to_dict('AGTC'), max_depth=5, verbose=True);

Problem 5(b). The Little Prince¶

In [747]:
nums = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'} 
en_abc = {' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 
          'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'}
fr_letters = {'é', 
              'à', 'è', 'ù',
              'â', 'ê', 'î', 'ô', 'û',
              'ç',
              'ë', 'ï', 'ü'}
In [ ]:
filename_list = ['data/sample_the_little_prince_en.txt',
                 'data/sample_the_little_prince_fr.txt']
alphabet_list = [en_abc, en_abc|fr_letters]
In [ ]:
for max_depth in [5,10]:
    print("# For max_depth={}".format(max_depth))
    for filename, alphabet in zip(filename_list, alphabet_list):
        print("\tFilename '{}': ".format(filename), end = '')
        %time compress_with_CTW(filename, chars_to_dict(alphabet),\
                                max_depth=max_depth, verbose=True);