Example Huffman Compression Routine in C#


This last week I decided to sit down and hash out a simple Huffman compression routine using C#. I’d never created a compression routine before from scratch (my past implementations were static for the sake of time savings), so I fleshed one out. I know that many examples exist elsewhere on the net…. but they all seemed overly complicated and up their own ass :P

I had a couple goals in mind while creating my routine:

1. KEEP IT SIMPLE — A lot of routines out there WORK, but their code is too overly complicated for their own good. This over complication leads to slowness which brings me to my next goal ;) It should be a simple class that accepts input data, with simple public accessors that are easy to understand even for the novice developer (sorry folks, no asynchronous delegates). :P

2. MAKE IT FAST — When dealing with large amounts of data in C#, especially when running it through an algorithm, it’s all too easy to use all the handy built in virtual methods or using other build in tools which make coding easier with speed being the sacrifice. Die hard C++ developers will point to these routines as C#’s downfall as a legitimate language when it comes to data intensive tasks.

The class I came up with is pretty simple. I use a Generic List to store a collection of “Leaf” objects, which have several basic attributes that help not only identify its value but also its place in the tree. Using this method, I was able to utilize the built in methods of the List object (I know, for shame…. but it’s easier in this instance) by firing off anonymous delegates for searches and comparators. I’m not terribly worried about using these virtual methods here only because the creation and encoding of the tree is usually the smallest task in the process ;)

The encoding and decoding is where I decided to focus on optimizations since this is where the BULK of the work is done. The .NET Framework has several methods that make working with binary data easy. You can use the Convert.ToString() method which allows you to pass in a BASE option, thus allowing you to convert any character to it’s binary representation. My original implementation used that method and the end result as embarrassingly slow :P

I went back to the drawing board and thought to myself, “If I had to re-write this in C++, how would I handle the encoding?” Duh, I’d be using bitwise operators up the wazoo! :)

After some recoding and pulling my hair out for a couple of hours, I was able to re-write the routine using bit operations and it works! On top of all that, it’s fast as all get out! My current benchmarks had it encoding a 1MB data set in under 1 second with ~50% compression. Not too shabby! Of course, compression ratios will vary depending on how normalized the input data set is.

Now I’m sure there’s some question you have on your mind and I’ll try to address them now:

Q: Does it use a lot of memory?

A: You bet your sweet ass it does! ;) Seriously though, it’s only the method in which I setup the class that requires the memory. I establish an input buffer within the class that you can write the ‘raw’ data to, which is then read from during encoding. In addition, during encoding I create an output buffer in memory where the ‘encoded’ data is written. So it stands to reason that if you’re encoding 100MB of data, this routine can easily gobble up 200MB of RAM or more. The rule of thumb I found was File Size * 4 would be the memory requirement. There’s optimizations you can make that would lower the memory footprint (like, build the frequency table without buffering then read the input 1 byte at a time, say from a file), but I felt that would over complicate the solution and make it too focused for one specific instance. The current implementation is kept general for a reason :)

Q: Could this be done faster in C++?

A: Yes, probably…. but not much faster. Although the code is written in C#, at run time the IL is compiled to x86. The bit operations we’re using would compile the exact same as a C++ routine (XOR is XOR, I don’t care what language you’re using). In addition, the encoding of the data itself is only using primitive native types which limits any cross language differences. In fact, you can paste the encode and decode routines into C++ and they work! :) Your only speed improvement might be in the generation of the tree itself… but even then, that’s super small overhead when compared with the amount of data you’re probably compressing.

Of course, all that applied to the Encoding (Compression) side of the house, the decompression routine is pretty slow (about 3 seconds per 1MB of decompressed data) and could probably use some more optimizations.

Q: Is this any better than using the built in GZip or Deflate classes available in System.IO.Compression?

A: It’s not even close to being in competition with those routines :P This is really just a proof of concept for BASIC Huffman Coding, which doesn’t take into account advanced features of modern compression routines such as pattern or content mapping. This routine is slower (especially in decompression), so I wouldn’t go making anything like this your #1 choice for a compression routine if others are available ;) So use this for educational purposes only.

Q: What version of the .NET Framework will this work with?

A: The code here was written in Visual Studio 2008 targeting .NET 3.5. I use some Framework 3.5 specific things (such as object initializers), but nothing that would make conversion difficult. I avoided LINQ only because I’m not entirely sold on the idea and I still like using anonymous delegates. ;) This code could be converted to Framework 2.0 with minor changes and possibly Framework 1.1, but that might require a little more effort.

Q: What is the format used to store the compressed data?

A: I encode the decompression information within the final output stream. The output format is like this:

Bytes 0 – 8: Final Output Size (Not used, but there as a checksum if needed in the future)

Byte 9: Number of Bytes in the Decode Dictionary

Bytes 10 – n: Decode Dictionary

Between the Decode Dictionary and the actual data I add the characters “BCD” (which stands for Binary Coded Data). This lets me know where the dictionary ends and the actual coded data begins. It helped during debugging and I figure it’d help anyone else out there as well while working with this routine :)

Q: What’s with the essay? Just give me the code!

A: Fine! ;) Seriously though, the only reason I’m doing such a long write-up on the code is to help people who are perhaps beginning to look into this sort of code for the first time and might have questions on why I did things a certain way. Understanding WHY the code was written helps understand how it operates.

So that’s the high and low of it! :)

Hope this helps someone out and if you have any questions, please feel free to leave a comment!

Cheers! :)

Huffman.zipDownload (3k)

, , , , ,

  1. #1 by Chris on July 2, 2009 - 1:09 PM

    Do you have any interest in modifying this such that I can use it to generate a dictionary (which can be persisted to disk) and use that dictionary for all future compression/decompression. It guess for Huffman, it’s a tree and not a dictionary. If not, could you give me some tips on what changes would be necessary? Thanks.

  2. #2 by Chris on July 2, 2009 - 1:12 PM

    Also, since the tree would be stored separately, I would not want it to be a part of the output stream.

  3. #3 by Chris on July 3, 2009 - 8:54 AM

    Why are you calling BuildTree from so many places? Looks redundant.

  4. #4 by Chris on July 3, 2009 - 9:15 AM

    Decode() should be static.

  5. #5 by Chris on July 3, 2009 - 11:46 AM

    I made the changes that I needed and emailed them to you. I’m wondering if the algorithm could be adapted to use a 16-bit tree instead of an 8-bit tree. Since I’m storing the tree separately now, I can afford a much larger tree if it means better compression.

  6. #6 by Chris on July 5, 2009 - 9:00 AM

    I was able to get the 16-bit version working. I had to do some optimizing in BuildTree, EncodeTree, and Decode. It does in fact yield better compression.

  7. #7 by Chris on July 5, 2009 - 9:01 AM

    Decode is around 100x faster now.

  8. #8 by Chris on July 6, 2009 - 7:57 PM

    I’ve made lots more changes since emailing you. 7.4 seconds for an encode/decode cycle on 100MB of data that is already in memory. Strangely, the 16-bit version runs even faster and gives better compression. But it’s still nothing compared to the big guys in terms of speed of compression.

  9. #9 by Chris on July 6, 2009 - 8:01 PM

    I meant “speed OR compression”

(will not be published)

Powered by WP Hashcash