failnix/targets/wasm-tacle/sequential/huff_enc/compress.txt

+===========================================================+
|     Introduction to the losslessy compression schemes     |
|           Description of the codec source codes           |
+-----------------------------------------------------------+
| From David Bourgin (E-mail: david.bourgin@ufrima.imag.fr) |
| Date: 22/9/94                                             |
+===========================================================+

                          ------ BE CARE ------
This file (compress.txt) is copyrighted. (c) David Bourgin - 1994
Permission to use this documentation for any purpose other than
its incorporation into a commercial product is hereby granted without fee.
Permission to copy and distribute this documentation only for non-commercial use
is also granted without fee, provided, however, that the above copyright notice
appears in all copies, that both that copyright notice and this permission notice appear in supporting documentation. The author makes no representations about
the suitability of this documentation for any purpose. It is provided "as is"
without express or implied warranty.

The source codes you obtain with this file are *NOT* covered by the same
copyright, because you can include them for both commercial and non-commercial
use. See below for more infos.

The source code files (codrl1.c, dcodrl1.c, codrle2.c, dcodrle2.c, codrle3.c,
dcodrle3.c, codrle4.c, dcodrle4.c, codhuff.c, dcodhuff.c) are copyrighted.
They have been uploaded on ftp in turing.imag.fr (129.88.31.7):/pub/compression
on 22/5/94 and have been modified on 22/9/94.
(c) David Bourgin - 1994
The source codes I provide have no buggs (!) but being that I make them
available for free I have some notes to make. They can change at any time
without notice. I assume no responsability or liability for any errors or
inaccurracies, make no warranty of any kind (express, implied or statutory)
with respect to this publication and expressly disclaim any and all warranties
of merchantability, fitness for particular purposes. Of course, if you have
some problems to use the information presented here, I will try to help you if
I can.

If you include the source codes in your application, here are the conditions:
- You have to put my name in the header of your source file (not in the
excutable program if you don't want) (this item is a must)
- I would like to see your resulting application, if possible (this item is not
a must, because some applications must remain secret)
- Whenever you gain money with your application, I would like to receive a very
little part in order to be encouraged to update my source codes and to develop
new schemes (this item is not a must)
                          ---------------------

There are several means to compress data. Here, we are only going to deal with
the losslessy schemes. These schemes are also called non-destructive because
you always recover the initial data you had, and this, as soon as you need them.
With losslessy schemes, you won't never lose any informations (except perhaps
when you store or transmit your data but this is another problem...).

In this introduction, we are going to see:
- The RLE scheme (with different possible algorithms)
- The Huffman schemes (dynamical scheme)
- And the LZW scheme

For the novice, a compresser is a program able to read several data (e.g. bytes)
in input and to write several data in output. The data you obtain from the
output (also called compressed data) will - of course - take less space than
the the input data. This is true in most of cases, if the compresser works
and if the type of the data is correct to be compressed with the given scheme.
The codec (coder-decoder) enables you to save space on your hard disk and/or
to save the communication costs because you always store/transmit the compressed
data. You'll use the decompresser as soon as you need to recover your initial
useful data. Note that the compressed data are useless if you have not
the decoder...

You are doubtless asking "How can I reduce the data size without losing some
informations?". It's easy to answer to this question. I'll only take an example.
I'm sure you have heard about the morse. This system established in the 19th
century use a scheme very close to the huffman one. In the morse you encode
the letters to transmit with two kinds of signs. If you encode these two sign
possibilities in one bit, the symbol 'e' is transmitted in a single bit and
the symbols 'y' and 'z' need four bits. Look at the symbols in the text you are
reading, you'll fast understand the compression ratio...

Important: The source codes associated to the algorithms I present are
completely adaptative on what you need to compress. They all use basical
macros on the top of the file. Usually the macros to change are:

- beginning_of_data
- end_of_data
- read_byte
- read_block
- write_byte
- write_block

These allow the programmer to modify only a little part of the header
of the source codes in order to compress as well memory as files.

beginning_of_data(): Macro used to set the program so that the next read_byte()
call will read the first byte to compress.
end_of_data(): Returns a boolean to know whether there is no more bytes to read
from the input stream. Return 0 if there is no more byte to compress, another
non-zero value otherwise.
read_byte(): Returns a byte read from the input stream if available.
write_byte(x): Writes the byte 'x' to the output stream.
read_block(...) and write_block(...): Same use as read_byte and write_byte(x)
but these macros work on blocks of bytes and not only on a single byte.

If you want to compress *from* the memory, before entering in a xxxcoding
procedure ('xxx' is the actual extension to replace with a given codec), you
have to add a pointer set up to the beginning of the zone to compress. Note
that the following pointer 'source_memory_base' is not to add, it is just given
here to specify a name to the address of the memory zone you are going to
encode or decode. That is the same about source_memory_end which can be either
a pointer to create or an existing pointer.

unsigned char *source_memory_base, /* Base of the source memory */
              *source_memory_end,  /* Last address to read.
           source_memory_end=source_memory_base+source_zone_length-1 */
              *source_ptr;         /* Used in the xxxcoding procedure */
void pre_start()
{ source_ptr=source_memory_base;
  xxxcoding();
}

end_of_data() and read_byte() are also to modify to compress *from* memory:

#define end_of_data()  (source_ptr>source_memory_end)
#define read_byte()  (*(source_ptr++))

If you want to compress *to* memory, before entering in a xxxcoding procedure
('xxx' is the actual extension to replace with a given codec), you have to add
a pointer. Note that the pointer 'dest_memory_base' is not to add, it is just
given there to specify the address of the destination memory zone you are
going to encode or decode.

unsigned char *dest_memory_base, /* Base of the destination memory */
              *dest_ptr;         /* Used in the xxxcoding procedure */
void pre_start()
{ dest_ptr=dest_memory_base;
  xxxcoding();
}

Of course, you can combine both from and to memory in the pre_start() procedure.
The files dest_file and source_file handled in the main() function are
to remove...

void pre_start()
{ source_ptr=source_memory_base;
  dest_ptr=dest_memory_base;
  xxxcoding();
}

In fact, to write to memory, the problem is in the write_byte(x) procedure.
This problem exists because your destination zone can either be a static
zone or a dynamically allocated zone. In the two cases, you have to check
if there is no overflow, especially if the coder is not efficient and must
produce more bytes than you reserved in memory.

In the first case, with a *static* zone, write_byte(x) macro should look like
that:

unsigned long int dest_zone_length,
                  current_size;

#define write_byte(x)  { if (current_size==dest_zone_length) \
                            exit(1); \
                         dest_ptr[current_size++]=(unsigned char)(x); \
                       }

In the static version, the pre_start() procedure is to modify as following:

void pre_start()
{ source_ptr=source_memory_base;
  dest_ptr=dest_memory_base;
  dest_zone_length=...; /* Set up to the actual destination zone length */
  current_size=0; /* Number of written bytes */
  xxxcoding();
}
Otherwise, dest_ptr is a zone created by the malloc instruction and you can try
to resize the allocated zone with the realloc instruction. Note that I increment
the zone one kilo-bytes by one kylo-bytes. You have to add two other variables:

unsigned long int dest_zone_length,
                  current_size;

#define write_byte(x)  { if (current_size==dest_zone_length) \
                            { dest_zone_length += 1024; \
                              if ((dest_ptr=(unsigned char *)realloc(dest_ptr,dest_zone_length*sizeof(unsigned char)))==NULL) \
                                 exit(1); /* You can't compress in memory \
                                               => I exit but *you* can make a routine to swap on disk */ \
                            } \
                         dest_ptr[current_size++]=(unsigned char)(x); \
                       }

With the dynamically allocated version, change the pre_start() routine as following:

void pre_start()
{ source_ptr=source_memory_base;
  dest_ptr=dest_memory_base;
  dest_zone_length=1024;
  if ((dest_ptr=(unsigned char *)malloc(dest_zone_length*sizeof(unsigned char)))==NULL)
     exit(1); /* You need at least 1 kb in the dynamical memory ! */
  current_size=0; /* Number of written bytes */
  xxxcoding();
  /* Handle the bytes in dest_ptr but don't forget to free these bytes with:
     free(dest_ptr);
  */
}

The previously given macros work as:

void demo()       /* The file opening, closing and variables
                     must be set up by the calling procedure */
{ unsigned char byte;
                  /* And not 'char byte' (!) */
  while (!end_of_data())
        { byte=read_byte();
          printf("Byte read=%c\n",byte);
        }
}

You must not change the rest of the program unless you're really sure and
really need to do it!

+==========================================================+
|                     The RLE encoding                     |
+==========================================================+

RLE is an acronym that stands for Run Length Encoding. You may encounter it
as an other acronym: RLC, Run Length Coding.

The idea in this scheme is to recode your data with regard to the repetition
frames. A frame is one or more bytes that occurr one or several times.

There are several means to encode occurrences. So, you'll have several codecs.
For example, you may have a sequence such as:
0,0,0,0,0,0,255,255,255,2,3,4,2,3,4,5,8,11

Some codecs will only deal with the repetitions of '0' and '255' but some other
will deal with the repetitions of '0', '255', and '2,3,4'.

You have to keep in your mind something important based on this example. A codec
won't work on all the data you will try to compress. So, in case of non
existence of sequence repetitions, the codecs based on RLE schemes must not
display a message to say: "Bye bye". Actually, they will try to encode these
non repeted data with a value that says "Sorry, I only make a copy of the inital
input". Of course, a copy of the input data with an header in front of this copy
will make a biggest output data but if you consider the whole data to compress,
the encoding of repeated frames will take less space than the encoding
of non-repeated frames.

All of the algorithms with the name of RLE have the following look with three
or four values:
- Value saying if there's a repetition
- Value saying how many repetitions (or non repetition)
- Value of the length of the frame (useless if you just encode frame
with one byte as maximum length)
- Value of the frame to repeat (or not)

I gave four algorithms to explain what I say.

*** First RLE scheme ***

The first scheme is the simpliest I know, and looks like the one used in MAC
system (MacPackBit) and some image file formats such as Targa, PCX, TIFF, ...

Here, all compressed blocks begin with a byte, named header, which description
is:

Bits   7 6 5 4 3 2 1 0
Header X X X X X X X X

Bits 7: Compression status (1=Compression applied)
     0 to 6: Number of bytes to handle

So, if the bit 7 is set up to 0, the 0 to 6 bits give the number of bytes
that follow (minus 1, to gain more over compress) and that were not compressed
(native bytes). If the bit 7 is set up to 1, the same 0 to 6 bits give
the number of repetition (minus 2) of the following byte.

As you see, this method only handle frame with one byte.

Additional note: You have 'minus 1' for non-repeated frames because you must
have at least one byte to compress and 'minus 2' for repeated frames because the
repetition must be 2, at least.

Compression scheme:

              First byte=Next
                    /\
                   /  \
Count the byte         Count the occurrence of NON identical
occurrences            bytes (maximum 128 times)
(maximum 129 times)    and store them in an array
        |                        |
        |                        |
  1 bit '1'                 1 bit '0'
+ 7 bits giving           + 7 bits giving
  the number (-2)           the number (-1)
  of repetitions            of non repetition
+ repeated byte           + n non repeated bytes
        |                        |
 1xxxxxxx,yyyyyyyy        0xxxxxxx,n bytes
[-----------------]      [----------------]

Example:

Sequence of bytes to encode | Coded values | Differences with compression
                            |              |         (unit: byte)
-------------------------------------------------------------------------
       255,15,              |  1,255,15,   |            -1
       255,255,             |    128,255,  |             0
        15,15,              |    128,15,   |             0
     255,255,255,           |   129,255,   |            +1
       15,15,15,            |    129,15,   |            +1
   255,255,255,255,         |   130,255,   |            +2
     15,15,15,15            |    130,15    |            +2

See codecs source codes: codrle1.c and dcodrle1.c

*** Second RLE scheme ***

In the second scheme of RLE compression you look for the less frequent byte
in the source to compress and use it as an header for all compressed block.

In the best cases, the occurrence of this byte is zero in the data to compress.

Two possible schemes, firstly with handling frames with only one byte,
secondly with handling frames with one byte *and* more. The first case is
the subject of this current compression scheme, the second is the subject
of next compression scheme.

For the frame of one byte, header byte is written in front of all repetition
with at least 4 bytes. It is then followed by the repetition number minus 1 and
the repeated byte.
Header byte, Occurrence number-1, repeated byte

If a byte don't repeat more than tree times, the three bytes are written without
changes in the destination stream (no header nor length, nor repetition in front
or after theses bytes).

An exception: If the header byte appears in the source one, two, three and up
times, it'll be respectively encoded as following:
- Header byte, 0
- Header byte, 1
- Header byte, 2
- Header byte, Occurrence number-1, Header byte

Example, let's take the previous example. A non frequent byte is zero-ASCII
because it never appears.

Sequence of bytes to encode | Coded values | Differences with compression
                            |              |         (unit: byte)
-------------------------------------------------------------------------
       255,15,              |    255,15,   |            -1
       255,255,             |   255,255,   |             0
        15,15,              |     15,15,   |             0
     255,255,255,           | 255,255,255, |             0
       15,15,15,            |   15,15,15,  |             0
   255,255,255,255,         |   0,3,255,   |            -1
     15,15,15,15            |    0,3,15    |            -1

If the header would appear, we would see:

Sequence of bytes to encode | Coded values | Differences with compression
                            |              |         (unit: byte)
-------------------------------------------------------------------------
          0,                |      0,0,    |            +1
         255,               |      255,    |             0
         0,0,               |      0,1,    |             0
         15,                |      15,     |             0
        0,0,0,              |      0,2,    |            -1
         255,               |      255,    |             0
       0,0,0,0              |     0,3,0    |            -1

See codecs source codes: codrle2.c and dcodrle2.c

*** Third RLE scheme ***

It's the same idea as the second scheme but we can encode frames with
more than one byte. So we have three cases:

- If it was the header byte, whatever is its occurrence, you encode it with:
Header byte,0,number of occurrence-1
- For frames which (repetition-1)*length>3, encode it as:
Header byte, Number of frame repetition-1, frame length-1,bytes of frame
- If no previous cases were detected, you write them as originally (no header,
nor length, nor repetition in front or after theses bytes).

Example based on the previous examples:

Sequence of bytes to encode |   Coded values   | Differences with compression
                            |                  |         (unit: byte)
-----------------------------------------------------------------------------
           255,15,          |      255,15,     |             0
           255,255,         |     255,255,     |             0
            15,15,          |       15,15,     |             0
         255,255,255,       |   255,255,255,   |             0
           15,15,15,        |     15,15,15,    |             0
       255,255,255,255,     | 255,255,255,255, |             0
         15,15,15,15,       |   15,15,15,15,   |             0
      16,17,18,16,17,18,    |16,17,18,16,17,18,|             0
     255,255,255,255,255,   |    0,4,0,255,    |            -1
       15,15,15,15,15,      |     0,4,0,15,    |            -1
 16,17,18,16,17,18,16,17,18,|  0,2,2,16,17,18, |            -3
  16,17,18,19,16,17,18,19   |0,1,3,16,17,18,19 |            -1

If the header (value 0) would be met, we would see:

Sequence of bytes to encode | Coded values  | Differences with compression
                            |               |         (unit: byte)
--------------------------------------------------------------------------
          0,                |     0,0,0,    |            +2
         255,               |      255,     |             0
         0,0,               |     0,0,1,    |            +1
          15,               |       15,     |             0
        0,0,0,              |     0,0,2,    |             0
         255,               |      255,     |             0
       0,0,0,0              |     0,0,3     |            -1

See codecs source codes: codrle3.c and dcodrle3.c

*** Fourth RLE scheme ***

This last RLE algorithm better handles repetitions of any kind (one byte
and more) and non repetitions, including few non repetitions, and does not
read the source by twice as RLE type 3.

Compression scheme is:

                  First byte=Next byte?
                           /\
                      Yes /  \ No
                         /    \
                 1 bit '0'     1 bit '1'
                       /        \
                      /          \
       Count the                    Motif of several
       occurrences                  repeated  byte?
       of 1 repeated                ( 65 bytes repeated
       byte (maximum                257 times maxi)
       16449 times)                           /\
            /\                               /  \
           /  \                             /    \
          /    \                           /      \
         /      \                         /        \
  1 bit '0'       1 bit '1'        1 bit '0'          1 bit '1'
+ 6 bits        + 14 bits        + 6 bits of              |
giving the      giving the       the length      Number of non repetition
length (-2)     length (-66)     of the motif         (maximum 8224)
of the          of the           + 8 bits of               /\
repeated byte   repeated byte    the number (-2)     < 33 /  \ > 32
+ repeated byte + repeated byte  of repetition           /    \
    |                |           + bytes of the   1 bit '0'       1 bit '1'
    |                |           motif          + 5 bits of     + 13 bits
    |                |               |          the numer (-1)  of the
    |                |               |          of non          number (-33)
    |                |               |          repetition      of repetition
    |                |               |          + non           + non
    |                |               |          repeated        repeated
    |                |               |          bytes           bytes
    |                |               |             |               |
    |                |               |             |  111xxxxx,xxxxxxxx,n bytes
    |                |               |             | [-------------------------]
    |                |               |             |
    |                |               |      110xxxxx,n bytes
    |                |               |     [----------------]
    |                |               |
    |                |  10xxxxxx,yyyyyyyy,n bytes
    |                | [-------------------------]
    |                |
    |   01xxxxxx,xxxxxxxx,1 byte
    |  [------------------------]
    |
 00xxxxxx,1 byte
[---------------]

Example, same as previously:

Sequence of bytes to encode |    Coded values     | Differences with compression
                            |                     |         (unit: byte)
--------------------------------------------------------------------------
       255,15               |   11000001b,255,15, |             +1
       255,255              |   00000000b,255,    |              0
        15,15               |    00000000b,15,    |              0
     255,255,255            |   00000001b,255,    |             -1
       15,15,15             |    00000001b,15,    |             -1
   255,255,255,255          |    00000010b,255,   |             -2
     15,15,15,15            |     00000010b,15,   |             -2
  16,17,18,16,17,18         |10000001b,0,16,17,18,|             -1
 255,255,255,255,255        |   00000011b,255,    |             -3
   15,15,15,15,15           |    00000011b,15,    |             -3
 16,17,18,16,17,18,16,17,18 | 10000001b,16,17,18, |             -4
  16,17,18,19,16,17,18,19   |10000010b,16,17,18,19|             -2

+==========================================================+
|                   The Huffman encoding                   |
+==========================================================+

This method comes from the searcher who established the algorithm in 1952.
This method allows both a dynamic and static statistic schemes. A statistic
scheme works on the data occurrences. It is not as with RLE where you had
a consideration of the current occurrence of a frame but rather a consideration
of the global occurrences of each data in the input stream. In this last case,
frames can be any kinds of sequences you want. On the other hand, Huffman
static encoding appears in some compressers such as ARJ on PCs. This enforces
the encoder to consider every statistic as the same for all the data you have.
Of course, the results are not as good as if it were a dynamic encoding.
The static encoding is faster than the dynamic encoding but the dynamic encoding
will be adapted to the statistic of the bytes of the input stream and will
of course become more efficient by producing shortest output.

The main idea in Huffman encoding is to re-code every byte with regard to its
occurrence. The more frequent bytes in the data to compress will be encoded with
less than 8 bits and the others could need 8 bits see even more to be encoded.
You immediately see that the codes associated to the different bytes won't have
identical size. The Huffman method will actually require that the binary codes
have not a fixed size. We speak then about variable length codes.

The dynamical Huffman scheme needs the binary trees for the encoding. This
enables you to obtain the best codes, adapted to the source data.
The demonstration won't be given there. To help the neophyt, I will just explain
what is a binary tree.

A binary tree is special fashion to represent the data. A binary tree is
a structure with an associated value with two pointers. The term of binary has
been given because of the presence of two pointers. Because of some conventions,
one of the pointer is called left pointer and the second pointer is called right
pointer. Here is a visual representation of a binary tree.

     Value
      / \
     /   \
 Value    Value
  / \      / \
... ...  ... ...

One problem with a binary encoding is a prefix problem. A prefix is the first
part of the representation of a value, e.g. "h" and "he" are prefixes of "hello"
but not "el". To understand the problem, let's code the letters "A", "B", "C",
"D", and "E" respectively as 00b, 01b, 10b, 11b, and 100b. When you read
the binary sequence 00100100b, you are unable to say if this comes from "ACBA"
or "AEE". To avoid such situations, the codes must have a prefix property.
And the letter "E" mustn't begin with the sequence of an other code. With "A",
"B", "C", "D", and "E" respectively affected with 1b, 01b, 001b, 0001b, and
0000b, the sequence 1001011b will only be decoded as "ACBA".

 1      0
<-  /\  ->
   /  \
 "A"  /\
    "B" \
        /\
      "C" \
          /\
       "D"  "E"

As you see, with this tree, an encoding will have the prefix property
if the bytes are at the end of each "branch" and you have no byte at the "node".
You also see that if you try to reach a character by the right pointer you add
a bit set to 0 and by the left pointer, you add a bit set to 1 to the current
code. The previous *bad* encoding provide the following bad tree:

       /\
      /  \
     /    \
    /\    /\
   /  \ "B" "A"
  /    \
"D"  "C"\
      /  \
         "E"

You see here that the coder shouldn't put the "C" at a node...

As you see, the largest binary code are those with the longest distance
from the top of the tree. Finally, the more frequent bytes will be the highest
in the tree in order you have the shortest encoding and the less frequent bytes
will be the lowest in the tree.

From an algorithmic point of view, you make a list of each byte you encountered
in the stream to compress. This list will always be sorted. The zero-occurrence
bytes are removed from this list. You take the two bytes with the smallest
occurrences in the list. Whenever two bytes have the same "weight", you take two
of them regardless to their ASCII value. You join them in a node. This node will
have a fictive byte value (256 will be a good one!) and its weight will be
the sum of the two joined bytes. You replace then the two joined bytes with
the fictive byte. And you continue so until you have one byte (fictive or not)
in the list. Of course, this process will produce the shortest codes if the list
remains sorted. I will not explain with arcana hard maths why the result
is a set of the shortest bytes...

Important: I use as convention that the right sub-trees have a weight greater
or equal to the weight of the left sub-trees.

Example: Let's take a file to compress where we notice the following
occurrences:

Listed bytes | Frequences (Weight)
----------------------------------
      0      |        338
     255     |        300
      31     |        280
      77     |         24
     115     |         21
      83     |         20
     222     |         5

We will begin by joining the bytes 83 and 222. This will produce a fictive node
1 with a weight of 20+5=25.

(Fictive 1,25)
      /\
     /  \
(222,5) (83,20)

Listed bytes | Frequences (Weight)
----------------------------------
      0      |        338
     255     |        300
      31     |        280
  Fictive 1  |         25
      77     |         24
     115     |         21

Note that the list is sorted... The smallest values in the frequences are 21 and
24. That is why we will take the bytes 77 and 115 to build the fictive node 2.

(Fictive 2,45)
      /\
     /  \
(115,21) (77,25)

Listed bytes | Frequences (Weight)
----------------------------------
      0      |        338
     255     |        300
      31     |        280
  Fictive 2  |         45
  Fictive 1  |         25

The nodes with smallest weights are the fictive 1 and 2 nodes. These are joined
to build the fictive node 3 whose weight is 40+25=70.

        (Fictive 3,70)
             /   \
           /       \
         /           \
       /\            / \
     /   \          /    \
(222,5)  (83,20) (115,21) (77,25)

Listed bytes | Frequences (Weight)
----------------------------------
      0      |        338
     255     |        300
      31     |        280
  Fictive 3  |         70

The fictive node 3 is linked to the byte 31. Total weight: 280+70=350.

             (Fictive 4,350)
                   /   \
                 /       \
               /           \
             /  \       (31,280)
           /      \
         /          \
       /\            / \
     /   \          /    \
(222,5)  (83,20) (115,21) (77,25)

Listed bytes | Frequences (Weight)
----------------------------------
  Fictive 4  |        350
      0      |        338
     255     |        300

As you see, being that we sort the list, the fictive node 4 has become the first
of the list. We join the bytes 0 and 255 in a same fictive node, the number 5
whose weight is 338+300=638.

(Fictive 5,638)
        /\
       /  \
(255,300) (0,338)

Listed bytes | Frequences (Weight)
----------------------------------
  Fictive 5  |        638
  Fictive 4  |        350

The fictive nodes 4 and 5 are finally joined. Final weight: 638+350=998 bytes.
It is actually the total byte number in the initial file: 338+300+24+21+20+5.

                         (Tree,998)
                       1     /  \     0
                      <-   /      \   ->
                         /          \
                       /              \
                     /                  \
                   /   \                / \
                 /       \             /    \
               /           \          /       \
             /  \       (31,280)  (255,300) (0,338)
           /      \
         /          \
       /\            / \
     /   \          /    \
(222,5)  (83,20) (115,21) (77,25)

Bytes | Huffman codes | Frequences | Binary length*Frequence
------------------------------------------------------------
  0   |       00b     |     338    |           676
 255  |       01b     |     300    |           600
  31  |       10b     |     280    |           560
  77  |      1101b    |      24    |            96
 115  |      1100b    |      21    |            84
  83  |      1110b    |      20    |            80
 222  |      1111b    |      5     |            20

Results: Original file size: (338+300+280+24+21+20+5)*8=7904 bits (=998 bytes)
versus 676+600+560+96+84+80+20=2116 bits, i.e. 2116/8=265 bytes.

Now you know how to code an input stream. The last problem is to decode all this
stuff. Actually, when you meet a binary sequence you can't say whether it comes
from such byte list or such other one. Furthermore, if you change the occurrence
of one or two bytes, you won't obtain the same resulting binary tree. Try for
example to encode the previous list but with the following occurrences:

Listed bytes | Frequences (Weight)
----------------------------------
     255     |        418
      0      |        300
      31     |        100
      77     |         24
     115     |         21
      83     |         20
     222     |         5

As you can observe it, the resulting binary tree is quite different, we had yet
the same initial bytes. To not be in such a situation we will put an header
in front of all data. I can't comment longly this header but I can say
I minimize it as much as I could. The header is divided into two parts.
The first part of this header looks closely to a boolean table (coded more or
less in binary to save space) and the second part provide to the decoder
the binary code associated to each byte encountered in the original input
stream.

Here is a summary of the header:

First part
----------
                        First bit
                          /  \
                      1 /      \ 0
                      /          \
  256 bits set to 0 or 1        5 bits for the number n (minus 1)
  depending whether the         of bytes encountered
  corresponding byte was        in the file to compres
  in the file to compress                   |
  (=> n bits set to 1,                     \ /
   n>32)                        n values of 8-bits (n<=32)
                     \           /
                       \       /
                         \   /
Second part                |
-----------                |
                           |
            +------------->|
(n+1) times |              |
(n bytes of |          First bit?
the values  |            /   \
encountered |         1 /      \ 0
in the      |          /        \
source file |   8 bits of      5 bits of the
+ the code  | the length       length (-1)
of a        | (-1) of the      of the following
fictive     | following        binary
byte        | binary code      code
to stop the | (length>32)      (length<=32)
decoding.   |          \       /
The fictive |           \     /
is set to   |            \   /
256 in the  |              |
Huffman     |         binary code
-tree of    |              |
encoding)   +--------------|
                           |
            Binary encoding of the source file
                           |
                  Code of end of encoding
                           |


With my codecs I can handle binary sequences with a length of 256 bits.
This correspond to encode all the input stream from one byte to infinite length.
In fact if a byte had a range from 0 to 257 instead of 0 to 255, I would have a
bug with my codecs with an input stream of at least 370,959,230,771,131,880,927,
453,318,055,001,997,489,772,178,180,790,105 bytes !!!

Where come this explosive number? In fact, to have a severe bug, I must have
a completely unbalanced tree:

   Tree
    /\
      \
      /\
        \
        /\
          \
          ...
           /\
             \
             /\

Let's take the following example:

Listed bytes | Frequences (Weight)
----------------------------------
      32     |         5
      101    |         3
      97     |         2
      100    |         1
      115    |         1

This produces the following unbalanced tree:

    Tree
     /\
(32,5) \
       /\
 (101,3) \
         /\
   (97,2)  \
           /\
    (115,1)  (100,1)

Let's speak about a mathematical series: The Fibonacci series. It is defined as
following:

{ Fib(0)=0
{ Fib(1)=1
{ Fib(n)=Fib(n-2)+Fib(n-1)

Fib(0)=0, Fib(1)=1, Fib(2)=1, Fib(3)=2, Fib(4)=3, Fib(5)=5, Fib(6)=8, Fib(7)=13,
etc.

But 1, 1, 2, 3, 5, 8 are the occurrences of our list! We can actually
demonstrate that to have an unbalanced tree, we have to take a list with
an occurrence based on the Fibonacci series (these values are minimal).
If the data to compress have m different bytes, when the tree is unbalanced,
the longest code need m-1 bits. In our little previous example where m=5,
the longest codes are associated to the bytes 100 and 115, respectively coded
0001b and 0000b. We can also say that to have an unbalanced tree we must have
at least 5+3+2+1+1=12=Fib(7)-1. To conclude about all that, with a coder that
uses m-1 bits, you must never have an input stream size over than Fib(m+2)-1,
otherwise, there could be a bug in the output stream. Of course, with my codecs
there will never be a bug because I can deal with binary code sizes of 1 to 256
bits. Some encoder could use that with m=31, Fib(31+2)-1=3,524,577 and m=32,
Fib(32+2)-1=5,702,886. And an encoder that uses unisgned integer of 32 bits
shouldn't have a bug until about 4 Gb...

+==========================================================+
|                     The LZW encoding                     |
+==========================================================+

The LZW scheme is due to three searchers, i.e. Abraham Lempel and Jacob Ziv
worked on it in 1977, and Terry Welch achieved this scheme in 1984.

LZW is patented in USA. This patent, number 4,558,302, is covered by Unisys
Corporation. You can usually write (without fees) software codecs which use
the LZW scheme but hardware companies can't do so. You may get a limited
licence by writting to:
Welch Licencing Department
Office of the General Counsel
M/S C1SW19
Unisys corporation
Blue Bell
Pennsylvania, 19424 (USA)

If you're occidental, you are surely using an LZW encoding every time you are
speaking, especially when you use a dictionary. Let's consider, for example,
the word "Cirrus". As you read a dictionary, you begin with "A", "Aa", and so
on. But a computer has no experience and it must suppose that some words
already exist. That is why with "Cirrus", it supposes that "C", "Ci", "Cir",
"Cirr", "Cirru", and "Cirrus" exist. Of course, being that this is a computer,
all these words are encoded as index numbers. Every time you go forward, you add
a new number associated to the new word. Being that a computer is byte-based
and not alphabetic-based, you have an initial dictionary of 256 letters instead
of our 26 ('A' to 'Z') letters.

Example: Let's code "XYXYZ". First step, "X" is recognized in the initial
dictionary of 256 letters as the 89th. Second step, "Y" is read. Does "XY"
exist? No, then "XY" is stored as the word 256. You write in the output stream
the ASCII of "X", i.e. 88. Now "YX" is tested as not referenced in the current
dictionary. It is stored as the word 257. You write now in the output stream 89
(ASCII of "Y"). "XY" is now met. But now "XY" is known as the reference 256.
Being that "XY" exists, you test the sequence with one more letter, i.e. "XYZ".
This last word is not referenced in the current dictionary. You write then the
value 256. Finally, you reach the last letter ("Z"). You add "YZ" as the
reference 258 but it is the last letter. That is why you just write the value
90 (ASCII of "Z").

Another encoding sample with the string "ABADABCCCABCEABCECCA".

+----+-----+---------------+------+----------+-------------------------+------+
|Step|Input|Dictionary test|Prefix|New symbol|Dictionary               |Output|
|    |     |               |      |          |D0=ASCII with 256 letters|      |
+----+-----+---------------+------+----------+-------------------------+------+
|  1 | "A" |"A" in D0      | "A"  |    "B"   | D1=D0                   |  65  |
|    | "B" |"AB" not in D0 |      |          | and "AB"=256            |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  2 | "A" |"B" in D1      | "B"  |    "A"   | D2=D1                   |  66  |
|    |     |"BA" not in D1 |      |          | and "BA"=257            |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  3 | "D" |"A" in D2      | "A"  |    "D"   | D3=D2                   |  65  |
|    |     |"AD" not in D2 |      |          | and "AD"=258            |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  4 | "A" |"D" in D3      | "D"  |    "A"   | D4=D3                   |  68  |
|    |     |"DA" not in D3 |      |          | and "DA"=259            |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  5 | "B" |"A" in D4      | "AB" |    "C"   | D5=D4                   |  256 |
|    | "C" |"AB" in D4     |      |          | and "ABC"=260           |      |
|    |     |"ABC" not in D4|      |          |                         |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  6 | "C" |"C" in D5      | "C"  |    "C"   | D6=D5                   |  67  |
|    |     |"CC" not in D5 |      |          | and "CC"=261            |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  7 | "C" |"C" in D6      | "CC" |    "A"   | D7=D6                   |  261 |
|    | "A" |"CC" in D6     |      |          | and "CCA"=262           |      |
|    |     |"CCA" not in D6|      |          |                         |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  8 | "B" |"A" in D7      | "ABC"|    "E"   | D8=D7                   |  260 |
|    | "C" |"AB" in D7     |      |          | and "ABCE"=263          |      |
|    | "E" |"ABC" in D7    |      |          |                         |      |
|    |    <"ABCE" not in D7|      |          |                         |      |
+----+-----+---------------+------+----------+-------------------------+------+
|  9 | "A" |"E" in D8      | "E"  |    "A"   | D9=D8                   |  69  |
|    |     |"EA" not in D8 |      |          | and "EA"=264            |      |
+----+-----+---------------+------+----------+-------------------------+------+
| 10 | "B" |"A" in D9      |"ABCE"|    "C"   | D10=D9                  |  263 |
|    | "C" |"AB" in D9     |      |          | and "ABCEC"=265         |      |
|    | "E" |"ABC" in D9    |      |          |                         |      |
|    | "C" |"ABCE" in D9   |      |          |                         |      |
|    |    <"ABCEC" not in D9>     |          |                         |      |
+----+-----+---------------+------+----------+-------------------------+------+
| 11 | "C" |"C" in D10     | "CCA"|          |                         |  262 |
|    | "A" |"CC" in D10    |      |          |                         |      |
|    |    <"CCA" not in D10|      |          |                         |      |
+----+-----+---------------+------+----------+-------------------------+------+

You will notice a problem with the above output: How to write a code of 256
(for example) on 8 bits? It's simple to solve this problem. You just say that
the encoding starts with 9 bits and as you reach the 512th word, you use a
10-bits encoding. With 1024 words, you use 11 bits; with 2048 words, 12 bits;
and so on with all numbers of 2^n (n is positive). To better synchronize
the coder and the decoder with all that, most of implementations use two
additional references. The word 256 is a code of reinitialisation (the codec
must reinitialize completely the current dictionary to its 256 initial letters)
and the word 257 is a code of end of information (no more data to read).
Of course, you start your first new word as the code number 258.

You can also do so as in the GIF file format and start with an initial
dictionary of 18 words to code an input stream with only letters coded on 4 bits
(you start with codes of 5 bits in the output stream!). The 18 initial words
are: 0 to 15 (initial letters), 16 (reinit the dictionary), and 17 (end of
information). First new word has code 18, second word, code 19, ...

Important: You can consider that your dictionary is limited to 4096 different
words (as in GIF and TIFF file formats). But if your dictionary is full, you
can decide to send old codes *without* reinitializing the dictionary. All the
decoders must be compliant with this. This enables you to consider that it is
not efficient to reinitialize the full dictionary. Instead of this, you don't
change the dictionary and you send/receive (depending if it's a coder or a
decoder) existing codes in the full dictionary.

My codecs are able to deal as well with most of initial size of data in the
initial dictionary as with full dictionary.

Let's see how to decode an LZW encoding. We saw with true dynamical Huffman
scheme that you needed an header in the encoding codes. Any header is useless
in LZW scheme. When two successive bytes are read, the first must exist in the
dictionary. This code can be immediately decoded and written in the output
stream. If the second code is equal or less than the word number in the current
dictionary, this code is decoded as the first one. At the opposite, if the
second code is equal to the word number in dictionary plus one, this means you
have to write a word composed with the word (the sentence, not the code number)
of the last code plus the first character of the last code. In between, you make
appear a new word. This new word is the one you just sent to the output stream,
it means composed by all the letters of the word associated to the first code
and the first letter of the word of the second code. You continue the processing
with the second and third codes read in the input stream (of codes)...

Example: Let's decode the previous encoding given a bit more above.

+------+-------+----------------+----------+------------------+--------+
| Step | Input | Code to decode | New code |    Dictionary    | Output |
+------+-------+----------------+----------+------------------+--------+
|   1  |   65  |       65       |    66    |     65,66=256    |   "A"  |
|      |   66  |                |          |                  |        |
+------+-------+----------------+----------+------------------+--------+
|   2  |   65  |       66       |    65    |     66,65=257    |   "B"  |
+------+-------+----------------+----------+------------------+--------+
|   3  |   68  |       65       |    68    |     65,68=258    |   "A"  |
+------+-------+----------------+----------+------------------+--------+
|   4  |  256  |       68       |    256   |     68,65=259    |   "D"  |
+------+-------+----------------+----------+------------------+--------+
|   5  |   67  |       256      |    67    |   65,66,67=260   |   "AB" |
+------+-------+----------------+----------+------------------+--------+
|   6  |  261  |       67       |    261   |     67,67=261    |   "C"  |
+------+-------+----------------+----------+------------------+--------+
|   7  |  260  |       261      |    260   |   67,67,65=262   |   "CC" |
+------+-------+----------------+----------+------------------+--------+
|   8  |   69  |       260      |    69    |  65,66,67,69=263 |  "ABC" |
+------+-------+----------------+----------+------------------+--------+
|   9  |  263  |       69       |    263   |     69,65=264    |   "E"  |
+------+-------+----------------+----------+------------------+--------+
|  10  |  262  |       263      |    262   |65,66,67,69,67=256| "ABCE" |
+------+-------+----------------+----------+------------------+--------+
|  11  |       |       262      |          |                  |  "CCA" |
+------+-------+----------------+----------+------------------+--------+

Summary: The step 4 is an explicit example. The code to decode is 68 ("D" in
ASCII) and the new code is 256. The new word to add to the dictionary is the
letters of the first word plus the the first letter of the second code (code
256), i.e. 65 ("A" in ASCII) plus 68 ("D"). So the new word has the letters 68
and 65 ("AD").

The step 6 is quite special. The first code to decode is referenced but the
second new code is not referenced being that the dictionary is limited to 260
referenced words. We have to make it as the second previously given case, it
means you must take the word to decode plus its first letter, i.e. "C"+"C"="CC".
Be care, if any encountered code is *upper* than the dictionary size plus 1, it
means you have a problem in your data and/or your codecs are...bad!

Tricks to improve LZW encoding (but it becomes a non-standard encoding):
- To limit the dictionary to an high amount of words (4096 words maximum enable
you to encode a stream of a maximmum 7,370,880 letters with the same dictionary)
- To use a dictionary of less than 258 if possible (example, with 16 color
pictures, you start with a dictionary of 18 words)
- To not reinitialize a dictionary when it is full
- To reinitialize a dictionary with the most frequent of the previous dictionary
- To use the codes from (current dictionary size+1) to (maximum dictionary size)
because these codes are not used in the standard LZW scheme.
Such a compression scheme has been used (successfully) by Robin Watts
<ct93008@ox.ac.uk>.

+==========================================================+
|                         Summary                          |
+==========================================================+

-------------------------------------------------
RLE type 1:
Fastest compression. Good ratio for general purpose.
Doesn't need to read the data by twice.
Decoding fast.
-------------------------------------------------
RLE type 2:
Fast compression. Very good ratio in general (even for general purposes).
Need to read the data by twice.
Decoding fast.
-------------------------------------------------
RLE type 3:
Slowest compression. Good ratio on image file,quite middle for general purposes.
Need to read the data by twice.
Change line:
#define MAX_RASTER_SIZE 256
into:
#define MAX_RASTER_SIZE 16
to speed up the encoding (but the result decreases in ratio). If you compress
with memory buffers, do not modify this line...
Decoding fast.
-------------------------------------------------
RLE type 4:
Slow compression. Good ratio on image file, middle in general purposes.
Change line:
#define MAX_RASTER_SIZE 66
into:
#define MAX_RASTER_SIZE 16
to speed up the encoding (but the result decreases in ratio). If you compress
with memory buffers, do not modify this line...
Decoding fast.
-------------------------------------------------
Huffman:
Fast compression. Good ratio on text files and similar, middle for general
purposes. Interesting method to use to compress a buffer already compressed by
RLE types 1 or 2 methods...
Decoding fast.
-------------------------------------------------
LZW:
Quite fast compression. Good, see even very good ratio, for general purposes.
Bigger the data are, better the compression ratio is.
Decoding quite fast.
-------------------------------------------------

The source codes work on all kinds of computers with a C compiler.
With the compiler, optimize the speed run option instead of space option.
With UNIX system, it's better to compile them with option -O.
If you don't use a GNU compiler, the source file MUST NOT have a size
over 4 Gb for RLE 2, 3, and Huffman, because I count the number
of occurrences of the bytes.
So, with GNU compilers, 'unsigned lont int' is 8 bytes instead of 4 bytes
(as normal C UNIX compilers and PCs' compilers, such as Microsoft C++
and Borland C++).
Actually:
* Normal UNIX compilers,		=> 4 Gb (unsigned long int = 4 bytes)
  Microsoft C++ and Borland C++ for PCs
* GNU UNIX compilers			=> 17179869184 Gb (unsigned long int = 8 bytes)

+==========================================================+
|                             END                          |
+==========================================================+