Lab 15 handouts
From SeedWiki
Student Name ___________________________ Student ID ___________________________________
Cumulative Log Loss - Two Pass
In the last lab, we worked on coding a sequence where each character was generated IID. For this lab, we consider the case where we assume that knowing what the current character is will help in predicting the next character in the sequence. For example if the current character is "t", then the probability that the next character is "h" is quite high. This model can be used to further compress the data.
Consider the text used for the home work 7. Find the conditional probability distribution P(Next Char | Current Char). The code to find the conditional probability distribution is here. Your task is to find the cumulative log loss which will indicate the number of bits required to code the sequence.
Cumulative log loss is calculated by computing the log loss for each character and then summing it up. For example, suppose current character is "a". We already have found the probability distribution of the next character. If the next character is "b", we suffer a loss of log(P(Next Char =b | Current Char = a)). We do this for the whole sequence and sum the log loss suffered to get the cumulative log loss.
Cumulative Log Loss - Add 1/2 rule
In the above case, the compression was done in 2 steps. First we go through the data to find the conditional distribution and after that we compress the data using the conditional distribution. Here we will find the cumulative log loss using "add 1/2" rule.
In "add 1/2" rule, we find the conditional probability online as follows:
P(next char = "b" | current char = "a" ) = [(#pair "ab" has occurred till now) + 1/2]/[(#a has occurred till now)+128/2]
Find the cumulative log loss when prediction is done based on "add 1/2" rule.