Arithmetic coding can be interpreted as a generalized change of radix. So we look at the following message sequence:
Each symbol in this message can be represented as a number in a certain base (A=0, B=1, C=2, D=3, etc).
If we make a table of frequencies and cumulative frequencies for this, it looks like the following:
Cumulative frequency is the total of a frequency and all frequencies below it in a frequency distribution.
It is the 'running total' of frequencies. So it is 1 for A, 3 for B, and 6 for D.
In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number. The radix is used to express any finite integer in a presumed multiplier in polynomial form. If we choose radix=6 (equal to the size of the message) and convert the message into a decimal number, we first map the letters into digits 301331 like this:
The result 23671 has a length of 15 bits is not close to the theoretical limit computed via entropy, which must be near 9 bits:
We have to compute LOW and HIGH limits and choose a convenient number between them. For the computation of the LOW limit we multiply each next term in the above expression by the product of the frequencies of all previously occurred symbols, so it is turned into the following expression:
The HIGH limit must be the LOW limit plus the product of all frequencies:
Now we can choose any number to represent the message from the semi-closed interval [LOW, HIGH), which we can take as a number with the longest possible trail of zeros, for example 25100. In case of a long message this trail of zeros will be much longer and can be either dropped or presented as an exponent. The number 251, received after truncation of zeros, has a length of 8 bits, which is even less than the theoretical limit. Here is the function I use to quickly get the ideal median number:
Code:
function midzero$(a,b)
a$=str$(a)
b$=str$(b)
if b-a<=1 then
select case
case right$(a$,1)="0":midzero$=a$:exit function
case right$(b$,1)="0":midzero$=b$:exit function
case else:midzero$=a$
end select
end if
la=len(a$)
z$=repeat$("0",la)
for x1=1 to la
ba$=left$(a$,x1):ba=val(ba$)
for x2=1 to 9-val(right$(ba$,1))
h=val(str$(ba+x2)+mid$(z$,x1+1))
if h>a and h<b then midzero$=str$(h):exit function
if h>b then exit for
next x2
next x1
end function
So now we have 251 (shortened from 25100), and I'd like to be able to convert this number back to the original message sequence:
The reverse process has two logical steps:
- identification of the symbol[/*:m:gg147xit]
- subtraction of the corresponding term from the result[/*:m:gg147xit]
In identification we divide the result by the correspondent power of 6 and the fractional part of the division is discarded. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified, the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals equal to 1 we have a special case of classic base change, but the computational part is the same.
The part I'm having trouble with is correcting the remainder. If I intend to use arithmetic coding to compress data, it would not make any sense (as far as efficiency is concerned) to require both the frequency set (3 1 2 3 3) and arithcoded number (25100) in order to decode. Retaining the original message length and the encoded number (25100) is not a problem; it's keeping that frequency information as well that gets me frustrated. Is there any way to correct the remainder without this data?
In case these are needed: