Anyone know arithmetic coding?

matt71 · Sep 5, 2010

So I'm having trouble understanding integer arithmetic coding. I was hoping someone on this board would be able to explain it to me in layman's terms (or pseudocode). The part I'm having particular trouble with is decoding the message without knowing the original frequencies of each character. Is there another way to correct the remainder when decoding without using these frequencies?

The example I'm referring to can be found on Wikipedia here:

Code:

http://en.wikipedia.org/wiki/Arithmetic_coding

Any guidance will go a long way!

Thanks :mrgreen:

Denis · Sep 6, 2010

Please post an actual problem; don't think anyone is interested in reading all that!!

matt71 · Sep 6, 2010

Arithmetic coding can be interpreted as a generalized change of radix. So we look at the following message sequence:

Each symbol in this message can be represented as a number in a certain base (A=0, B=1, C=2, D=3, etc).
If we make a table of frequencies and cumulative frequencies for this, it looks like the following:

Cumulative frequency is the total of a frequency and all frequencies below it in a frequency distribution.
It is the 'running total' of frequencies. So it is 1 for A, 3 for B, and 6 for D.

In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number. The radix is used to express any finite integer in a presumed multiplier in polynomial form. If we choose radix=6 (equal to the size of the message) and convert the message into a decimal number, we first map the letters into digits 301331 like this:

The result 23671 has a length of 15 bits is not close to the theoretical limit computed via entropy, which must be near 9 bits:

We have to compute LOW and HIGH limits and choose a convenient number between them. For the computation of the LOW limit we multiply each next term in the above expression by the product of the frequencies of all previously occurred symbols, so it is turned into the following expression:

The HIGH limit must be the LOW limit plus the product of all frequencies:

Now we can choose any number to represent the message from the semi-closed interval [LOW, HIGH), which we can take as a number with the longest possible trail of zeros, for example 25100. In case of a long message this trail of zeros will be much longer and can be either dropped or presented as an exponent. The number 251, received after truncation of zeros, has a length of 8 bits, which is even less than the theoretical limit. Here is the function I use to quickly get the ideal median number:

Code:

function midzero$(a,b)
    a$=str$(a)
    b$=str$(b)
    if b-a<=1 then
        select case
            case right$(a$,1)="0":midzero$=a$:exit function
            case right$(b$,1)="0":midzero$=b$:exit function
            case else:midzero$=a$
        end select
    end if
    la=len(a$)
    z$=repeat$("0",la)
    for x1=1 to la
        ba$=left$(a$,x1):ba=val(ba$)
        for x2=1 to 9-val(right$(ba$,1))
            h=val(str$(ba+x2)+mid$(z$,x1+1))
            if h>a and h<b then midzero$=str$(h):exit function
            if h>b then exit for
        next x2
    next x1
end function

So now we have 251 (shortened from 25100), and I'd like to be able to convert this number back to the original message sequence:

The reverse process has two logical steps:

identification of the symbol[/*:m:gg147xit]
subtraction of the corresponding term from the result[/*:m:gg147xit]

In identification we divide the result by the correspondent power of 6 and the fractional part of the division is discarded. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified, the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals equal to 1 we have a special case of classic base change, but the computational part is the same.

The part I'm having trouble with is correcting the remainder. If I intend to use arithmetic coding to compress data, it would not make any sense (as far as efficiency is concerned) to require both the frequency set (3 1 2 3 3) and arithcoded number (25100) in order to decode. Retaining the original message length and the encoded number (25100) is not a problem; it's keeping that frequency information as well that gets me frustrated. Is there any way to correct the remainder without this data?

In case these are needed:

Anyone know arithmetic coding?

matt71

New member

Denis

Senior Member

matt71

New member